scIAE is an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data. It can be used to perform feature extraction, identify cell type and predict disease status.
Given gene expression matrix and label (cell type annotation or disease status) of training set and the gene expression matrix of testing set, scIAE can provide the predicted label of testing set. If true label of testing set is given, the evaluation criteria (ACC, MeanF1, and MedF1) can be calculated to evaluate the classification effectiveness of scIAE. If the number of base classifiers is set to one, the dimensionality reduction result of testing set can also be obtained.
scIAE corresponds to the following paper:
Yin, Q., Wang, Y., Guan, J., Ji, G.. scIAE: an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data, Brief Bioinform, 2022, 23(1): bbab508. https://doi.org/10.1093/bib/bbab508
Depends:
R (>= 4.0.2)
Python (>= 3.8.3)
Requirements:
keras (>= 2.4.3)
tensorflow (>=2.3.1)
library("keras")
library("parallel")
library("caret")
library("e1071")
library("kknn")
library("rpart")
library("rBayesianOptimization")
Run main.R
. The parameters can be changed as below.
The datasets analyzed in the paper are available at: https://doi.org/10.5281/zenodo.5168428. If users want to use their own datasets, the order of cells in gene expression matrix should correspond to that in labels. Rows refer to cells, and columns refer to genes.
train_data <- as.matrix(read.csv("pancreas_smartseq_data.csv", row.names = 1)) #gene expression matrix of training set (matrix or data.frame, not null)
train_info <- read.csv("pancreas_smartseq_label.csv", row.names = 1)[, 1] #label of training set (character or integer, not null)
test_data <- as.matrix(read.csv("pancreas_celseq_data.csv", row.names = 1)) #gene expression matrix of testing set (matrix or data.frame, not null)
test_info <- read.csv("pancreas_celseq_label.csv", row.names = 1)[, 1] #label of testing set (character or integer, should be provided when calculating ACC, MeanF1, and MedF1)
> train_data[1:5,1:5]
GCG PPY REG1A INS SST
AZ_A10 2.691009 13.919472 2.770087 5.973918 15.554698
AZ_A11 16.082961 2.459152 3.445946 2.201366 4.281374
AZ_A12 5.319866 4.941085 3.733110 1.017788 15.036942
AZ_A2 4.435744 15.090921 3.333762 2.358494 3.447319
AZ_A5 8.230754 6.415130 6.945927 2.383204 6.895719
> head(train_info)
[1] "delta" "alpha" "delta" "gamma" "ductal" "alpha"
> test_data[1:5,1:5]
REG1A INS GCG CHGB TM4SF4
D28.1_1 5.314580 0 10.642062 8.069391 6.143180
D28.1_13 7.106880 0 5.975432 0.000000 0.000000
D28.1_15 5.314580 0 10.642062 8.207436 6.979185
D28.1_17 4.518002 0 10.642062 5.725955 5.657918
D28.1_2 4.216514 0 6.061108 1.588734 1.588734
> head(test_info)
[1] "alpha" "ductal" "alpha" "alpha" "endothelial" "endothelial"
get_intersection()
can get overlapping genes between training set and testing set. In this case, the gene expression matrices of training set and testing set should have gene names.
> dim(train_data)
[1] 2166 10698
> dim(test_data)
[1] 2122 6878
> data_intersection <-get_intersection(train_data,test_data)
> train_data <- data_intersection[[1]]
> test_data <- data_intersection[[2]]
> dim(train_data)
[1] 2166 4943
> dim(test_data)
[1] 2122 4943
Note that the data used here is the one from the Hemberg lab, which is different from that we uploaded to Zenodo. The datasets we uploaded to Zenodo were pre-processed, including extracting overlapping genes between training set and testing set.
cross_validation()
can perform cross validation for tuning parameters of scIAE, including the number of base classifiers, denoising rate, lambda (regularization parameter), activation functions of hidden layer and output layer, and the encoded dimensions in each stack. Moreover, the function can be used to tune the hyperparameters of base classifiers, including the cost and gamma for SVM, the split criterion for DT, the number of neighbors for kNN, and the number of components for PLSDA. The inputs of cross_validation()
contain intervals of parameters given by users, training data and corresponding label, and the number of folds for cross validation (default: 5). Then, the function can perform cross validation and return ACC, MeanF1, and MedF1 for each parameter combination. Users can choose the parameters to be used based on their preferences.
> cv_result <- cross_validation(train_data,
train_info,
t_interval = c(5,10,15),
denoising_rate_interval = c(0.1,0.2,0.3),
lambda_interval = c(1e-4,1e-5),
base_classifier = 'SVM',
cost_interval = c(8,16),
gamma_interval = c(1/500,1/1000))
scIAE()
returns predicted results of testing data. Its inputs are listed below.
train_data: gene expression matrix of training set (matrix or data.frame, not null)
train_info: label of training set (character or integer, not null)
test_data: gene expression matrix of testing set (matrix or data.frame, not null)
t: number of base classifiers (integer, default: 10)
denoising_rate: denoising rate in the input layer (numeric, default: 0.2)
lambda: L1 regularization parameter (numeric, default: 1e-5)
activation_hidden: activation function used in the hidden layer of each stack (in c('linear','sigmoid','tanh','relu','exponential','softmax'), default: 'sigmoid')
activation_output: activation function used in the output layer of each stack (in c('linear','sigmoid','tanh','relu','exponential','softmax'), default: 'sigmoid')
batch_size: batch size in training autoencoder (integer, default: 256)
learning_rate: learning rate in training autoencoder (numeric, default: 0.001)
epochs: epochs in training autoencoder (integer, default: 40)
encoded_1: encoded dimension of stack 1 (integer, default: 1024)
encoded_2: encoded dimension of stack 2 (integer, default: 128)
base_classifier: base classifier algorithm (in c('SVM','DT','kNN','PLSDA'), default: 'SVM')
verbose: if current ensemble is printed or not (logical, default: TRUE)
cost: cost of constraints violation if base_classifier is 'SVM' (numeric, Default:16)
gamma: parameter for radial basis if base_classifier is 'SVM' (numeric, Default:1/1000)
split: split rule if base_classifier is 'DT' (in c('gini','information'), Default: 'information')
kNN_k: number of neighbors if base_classifier is 'kNN' (integer, Default:5)
n_components: number of components if base_classifier is 'PLSDA' (integer, Default:10)
unassigned: if the classifier gives 'unassigned' label or not (logical, Default: FALSE)
unassigned_threshold: the probability threshold of giving 'unassigned' label (numeric, Default: NA)
DR_output: if the dimensionality reduction result of testing set is returned or not (logical, Default: TRUE)
Run scIAE()
, then predicted results can be obtained.
> scIAE_output <- scIAE (train_data,train_info,test_data)
> pred_labels <- scIAE_output[['pred_labels']]
> head(pred_labels)
[1] "alpha" "ductal" "alpha" "alpha" "endothelial" "endothelial"
If t=1
, then the dimensionality reduction result of testing set can also be obtained.
> scIAE_output <- scIAE (train_data,train_info,test_data,t=1)
> DR_result <- scIAE_output[['DR_result']]
> dim(DR_result)
[1] 2122 128
If test_info
is provided, evaluate()
calculates the evaluation criteria (ACC, MeanF1, and MedF1).
> true_labels <- test_info
> result <- evaluate(true_labels, pred_labels)
> print(result)
[1] 0.9161169 0.8339408 0.9531429