GrandCC: an Interpretable Graph Attention Networks-based Framework for Cancer Subtype Classification

In this project, we develop a GrandCC model based on a graph attention network to perform molecular subtyping.

The model was validated on two different cancers, including the CMS subtyping of Colorectal cancer (CRC) and the PAM50 classification of breast cancer.

The model was deployed as an easy-to-use online tool, please try the website https://gene.ai.tencent.com/MolecularSubtyping.

Requirements

Python 3.6
PyTorch >= 1.5.0
torch_geometric 1.6.2
numpy
pandas
scipy
sklearn
opencv
random
statsmodels

Task 1. CRCSC subtyping：

The GrandCC model is trained on the TCGA dataset and tested on other 11 datasets.

Usage：

Validate the model with the 11 test dataset used in our paper, including gse13067, gse13294, gse14333, gse17536, gse20916, gse2109, gse35896, gse37892, gse39582, kfsyscc, petacc3. You can obtain the same results as our paper with the following options:
```
cd CRC_GrandCC_opensource
python3 test_model.py --exp GNN_sim_graph （The pretrained model is saved at: save_models/GNN_sim_graph.pt）
```
If you want to evaluate our model with your own data, please put your data file in the 'CRC_GrandCC_opensource/ external_dataset' folder (must be in the "txt" format, please follow the details in the below "Data Preparation"). Run the following:
```
cd CRC_GrandCC_opensource
python3 external_test.py --exp GNN_sim_graph --external_file name of your file
```
** The operations in our codes are as follows:
- The data uploaded by the user must be in "txt" file, and the genes must be represented with entrez ids.
- Our model will check the overlap of the genes in your uploaded data and our training dataset.
- If the overlap ratio is less than 70%, the model will exit without making predictions.
- If the overlap ratio is larger than 70%, the model will match the order of the genes and impute the missing genes with all 0, and make predictions using the imputed data.

Data Description：

The similarity graph is saved at：./similarity_graphs/TCGA_sim_graph.csv in the following format:

0 1 2 3 ... Num_edge - 1 Num_edge

0 0 | 1 | 1 | 1 | ... | 5972 | 5972 1 0 | 1 | 24 | 43 | ... | 5120 | 5972

The first row denotes the index of the connected edges. The second and third rows indicate the indices of the 
two nodes connected by the edge.

The dataset is saved at：./CRC_dataset.

The data structure is：

├── gse13067.txt                       
├── gse13294.txt
├── gse14333.txt
├── gse17536.txt
├── gse20916.txt
├── gse2109.txt 
├── gse35896.txt
├── gse37892.txt
├── gse39582.txt
├── kfsyscc.txt
├── petacc3.txt
├── tcga.txt
├── labels.txt

The gene expression data of each dataset is contained in a txt file in the following format. Each row represents the gene expression of one patient, while each column denotes the expression of one gene (Entrez ID):

sample	153396	1378	5655	...	5345	6894
GSM523257	5.15212166212432	2.88664048871183	5.3760491573249	...	6.2371587886039	6.4355759773902
GSM523258	4.94997883745515	2.37108757063653	5.0852982410145	...	6.5261487210994	6.6052447056984
...	...	...	...	...	...	...
GSM523263	5.22185205784889	2.20199502833056	5.3918173202651	...	6.4885185455419	6.9666661456165

** If you want to evaluate our model with your own data, please prepare the data in this form and make sure the genes are represented in the Entrez ID.

The labels of all data are contained in the ./CRC_dataset/labels.txt file, with the column 'CMS_network' used to train and evaluate the model performance. The file is in the following format:

sample cohort DeepCC DeepCC_pre CMS_network

TCGA-A6-6653 tcga CMS1 CMS1 CMS1

TCGA-A6-A56B tcga CMS4 CMS4 CMS4

... ... ... ... ...

TCGA-AD-6963 tcga CMS2 CMS2 CMS2

sample	cohort	DeepCC	DeepCC_pre	CMS_network
TCGA-A6-6653	tcga	CMS1	CMS1	CMS1
TCGA-A6-A56B	tcga	CMS4	CMS4	CMS4
...	...	...	...	...
TCGA-AD-6963	tcga	CMS2	CMS2	CMS2

Model Prediction：

The predictions of our model are saved in the "./results" folder, in the "csv" form as follows:

sample_id	CMS1_prob	CMS2_prob	CMS3_prob	CMS4_prob	GNN_pred	CMS network
0	0.11788527	0.6282107	0.120226026	0.13367797	1	1
1	0.09888291	0.66571975	0.13151056	0.10388676	1	1
...	...	...	...	...	...	...
N-1	0.38062993	0.10873932	0.07082967	0.4398011	3	0
N	0.13789913	0.33324537	0.35867894	0.17017654	2	2

Each row represents the prediction of a patient.
The 2-5 columns are the predicted probabilities for the four CMS subtypes.
The column "GNN_pred" is the class predicted by our model.
The column "CMS network" is the ground-truth class.
For the prediction of your own data, the result file will not include the last column "CMS network" because we don't require the ground-truth labels from you.

Evaluation：

Please refer to model_evaluation.Rmd file to evaluate the performance.

Task 2: Breast cancer PAM50 classification：

The GrandCC model is trained on the TCGA dataset and tested on other 5 datasets.

Usage：

Validate the model with the 5 test dataset used in our paper, including NKI, TRANSBIG, UNT, UPP, VDX. You can obtain the same results as our paper with the following options:
```
cd BRCA_GrandCC_opensource
python3 test_model.py --which_graph similarity --exp GNN_sim_graph --num_nodes 3000
```
If you want to evaluate our model with your own data, please put your data file in the 'BRCA_GrandCC_opensource/ external_dataset' folder (must be in the "txt" format, please follow the details in the "Data Preparation"). Run the following:
```
cd BRCA_GrandCC_opensource
python3 external_test.py --which_graph similarity --exp GNN_sim_graph --num_nodes 3000 --external_file name of your file
```
** The operations in our codes are as follows:
- The data uploaded by the user must be in "txt" file, and the genes must be represented with entrez ids.
- Our model will check the overlap of the genes in your uploaded data and our training dataset.
- If the overlap ratio is less than 70%, the model will exit without making predictions.
- If the overlap ratio is larger than 70%, the model will match the order of the genes and impute the missing genes with all 0, and make predictions using the imputed data.

Data Description：

The dataset is saved at：./BRCA_data_label/intersect_genes.
The similarity graph is saved at：./similarity_graphs/TCGA_sim_graph.csv.
The file format of the gene expression data and labels are the same as the previously introduced BRCA datasets. If you want to evaluate our model with your own data, please make sure to prepare the data in the correct form.

Model Prediction：

The predictions of our model are saved in the "./results" folder, in the "csv" form as follows:

sample_id	Basal_prob	LumA_prob	LumB_prob	Her2_prob	Normal_prob	GNN_pred	label
NKI_4	0.05495929	0.02721681	0.47734076	0.33200058	0.10848256	2	2
NKI_6	0.000952676	0.79077834	0.18619132	0.015031756	0.007045906	1	1
...	...	...	...	...	...	...	...
NKI_8	0.8464114	0.004770978	0.037902106	0.0896943	0.021221258	0	0
NKI_11	0.060146123	0.13140255	0.47476345	0.30161282	0.03207499	2	3

Each row represents the prediction of a patient.
The 2-6 columns are the predicted probabilities for the five PAM50 subtypes.
The column "GNN_pred" is the class predicted by our model.
The column "label" is the ground-truth class.
For the prediction of your own data, the result file will not include the last column "label" because we don't require the ground-truth labels from you.

Disclaimer

This tool is for research purpose and not approved for clinical use.

This is not an official Tencent product.

Copyright

This tool is developed in Tencent AI Lab.

yfzon/GrandCC

GrandCC: an Interpretable Graph Attention Networks-based Framework for Cancer Subtype Classification

Requirements

Task 1. CRCSC subtyping：

Usage：

Data Description：

Model Prediction：

Evaluation：

Task 2: Breast cancer PAM50 classification：

Usage：

Data Description：

Model Prediction：

Disclaimer

Copyright