This is a TensorFlow implementation of scGCN for leveraging and label transfer across differnt single cell datasets.
Single-cell omics represent the fastest-growing genomics data type in the literature and the public genomics repositories. Leveraging the growing repository of labeled datasets and transferring labels from existing datasets to newly generated datasets will empower the exploration of the single-cell omics. The current label transfer methods have limited performance, largely due to the intrinsic heterogeneity among cell populations and extrinsic differences between datasets. Here, we present a robust graph artificial intelligence model, single-cell Graph Convolutional Network (scGCN), to achieve effective knowledge transfer across disparate datasets. Benchmarked with other label transfer methods on different single cell omics datasets, scGCN has consistently demonstrated superior accuracy on leveraging cells from different tissues, platforms, and species, as well as cells profiled at different molecular layers. scGCN is implemented as an integrated workflow and provided here.
- setuptools >= 40.6.3
- numpy >= 1.15.4
- tensorflow >= 1.15.0
- networkx >= 2.2
- scipy >= 1.1.0
Download scGCN:
git clone https://github.com/QSong-github/scGCN
Install requirements and scGCN:
python setup.py install
The general installation time is less than 10 seconds, and have been tested on mac OS and linux system.
load the example data using the data_preprocess.R script In the example data, we include the data from Mouse (reference) and Human (query) of GSE84133 dataset. The reference dataset contains 1,841 cells and the query dataset contains more cells (N=7,264) and 12,182 genes.
cd scGCN
Rscript data_preprocess.R # load example data
python train.py # run scGCN
All output will be shown in the output_log.txt file. Performance will be shown at the bottom. We also provide the Seurat performance on this reference-qeury set (as in Figure 4), by run
Rscript Seurat_result.R
When using your own data, you have to provide
- the raw data matrix of reference data and cell labels
- the raw data matrix of query data
The output files with scGCN predicted labels will be stored in the results folder.
We also provide other GCN models includidng GAT (Veličković et al., ICLR 2018), HyperGCN (Chami et al., NIPS 2019) and GWNN (Xu et al., ICLR 2019) for optional use.
For the query data that have cell types not appearing in reference data, we provide a screening step in our scGCN model using two statistical metrics, entropy score and enrichment score. If certain cells in query data have higher entropy and lower enrichment, these cells should be assigned as unknown cells. Specifically, choose check_unknown=TRUE in the function 'save_processed_data' to detect unknown cells.
The above scripts can reproduce the quantitative results in our manuscript based on our provided data.
Please cite our paper and the related GCN papers if you use this code in your own work:
Song, Q., Su, J., & Zhang, W. (2020). scGCN: a Graph Convolutional Networks Algorithm for Knowledge Transfer in Single Cell Omics. bioRxiv.