scParser: A C++ repository from kai0511

scParser

scParser is an ensemble computational tool for interpretive single-cell RNA-seq data analysis. It decomposes variation from multiple biological conditions and cellular variation across bio-samples into shared low-rank latent spaces. Our approach enables various downstream analysis. See the preprint in References section for details of downstream analysis.

Dependencies

The R package of scParser has the following dependencies.

Rcpp (>= 1.0.9), RcppArmadillo, dplyr

Installation

scParser employs OpenMP to support parallel computing. One can install OpenMP before the installation of the package to enjoy the benefit. This package can be installed via the following two ways.

Install via devtools

install.packages("devtools") 
require(devtools)
install_github("kai0511/scParser")

Install locally

Download the zip file from GitHub, unzip it, and go into the directory. Then, it can be installed by running the following commands

R CMD build .
R CMD INSTALL scParser_1.0.tar.gz

Usage

Data preparation

Here we use the data matrix (22753*2000) of DM dataset as a toy example for illustration. The data is placed in the data directory of this package.

require(scParser)

load("./data/E-HCAD-31_log_transformed_matrix.rds")  # load the data and the exact path for the example data matrix depends.

head(dataset[,1:4])
                        disease_id donnor_id ENSG00000108849 ENSG00000157005
SRR5818088-AAAAAAAAAAAA          1         1        0.000000        0.000000
SRR5818088-AAAAAAAATAGG          1         1        0.000000        2.954825
SRR5818088-AAAAGACGAACG          1         1        7.223458        0.000000
SRR5818088-AAAAGGGCGAAC          1         1        0.000000        2.279514
SRR5818088-AAAAGGGGAACA          1         1        0.000000        0.000000
SRR5818088-AAAAGGGGAACC          1         1        0.000000        0.000000

end_idx <- 2  # The end index for covariate matrix
data[is.na(dataset)] <- 0 # cast NAs to zeros

# In the example data, there are 2 biological variables: diabetes diagnosis (disease_id), donnor id.
confounders <- as.matrix(dataset[ ,1:end_idx])   # matrix for biological variables

Create scParser object

object <- scParser(as.matrix(dataset[,-c(1,2)]), confounders, split_ratio = 0.1, global_tol = 1e-8, sub_tol = 1e-5, tuning_iter = 30)

It needs the following arguments:

data: A log-transformed expression data matrix. For data matrix with large sample size, it is recommended to sample a proportion of observations across bio-samples for training to accelerate the process of model selection.
confounder: A confounder matrix. The elements of the matrix are used as indices to extract corresponding latent representation, so its elements are integer and greater than 0;
split_ratio: define the proportion of elements in the data matrix used as test set for hyperparameter tuning. The default value for it is 0.1.
global_tol: defines global convergence tolerance for scParser. Note scParser check convergence every 10 iterations, and the default value for global_tol is 1e-8.
sub_tol: defines the convergence criteria for elastic net problems. Its impact on global convergence rate is small. By default, it is 1e-5.
tuning_iter: the number of iterations to run for each try of hyperparameter combinations. In practice, 20 or 30 iterations per try work fine in practice. By default, it is 30.
max_iter: the maximum number of iterations. When it reaches, iteration will terminate even if the global convergence criteria do not meet. Its default value is 10000.

Tune hyperparameters

object <- tune(object, latent_rank = as.integer(seq(10, 30, by = 2)), lambda1 = c(0.1, 1, 10, 20, 50, 100), lambda2 = c(0.01, 0.1, 0.2, 0.4, 0.9))

It has the following arguments:

object: An scParser object created with the above arguments;
latent_rank: An integer vector from which the rank of latent dimension is chosen;
cfd_rank (optional): An integer vector from which the rank of latent dimension for modeling variation from biological conditions is chosen. If not applied, we assume that $K_1 = K_2$. See reference for details.
lambda1: A numeric vector from which the tuning parameter lambda1 is selected. $\lambda_1$ controls penalty for latent representations for biological conditions.
lambda2: A numeric vector from which the tuning parameter lambda2 is selected. $\lambda_2$ controls penalty for latent representations for cellular representations.
alpha (optional): A numeric vector from which the tuning parameter alpha is selected. By default, $\alpha$ is 1.

Model fitting

After parameter tuning, the results for tuning will be saved in the current directory. One chose the combination of hyperparameters with the lowest RMSE on test, and fit scParser with it. The parameters for the function fit are as follows:

object: a scParser object;
latent_rank: an integer for the rank of the latent space for modeling celular variation;
cfd_rank (optional): if not provided then latent_rank is used. It should be determined by parameter tuning in the previous step;
batch_num: the number of batches used when scParser is fitted with batch strategy. The parameter is considered only when is_batch is switched on. Its value is NULL by default.
is_batch: indicating whether the batch strategy is on. By default, its value is FALSE;
lambda1: L2 penalty for latent representations for biological conditions;
lambda2: L2 penalty for latent representations of cells;
alpha: L1 penalty for latent representations of cells. Note: lambda2 and alpha form the elastic net regularization for latent representations of cells.

# selected hyperparameters for scParser
num_factors <- 11
lambda1 <- 40   
lambda2 <- 0.7

object <- fit(object, as.integer(num_factors), lambda = lambda, alpha = alpha)
save(object, file = paste0("scParser_DM_R", num_factors, "_fitted_object.RData"))

> str(object)
List of 9
 $ data           : num [1:22753, 1:2000] 0 0 7.22 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:22753] "SRR5818088-AAAAAAAAAAAA" "SRR5818088-AAAAAAAATAGG" "SRR5818088-AAAAGACGAACG" "SRR5818088-AAAAGGGCGAAC" ...
  .. ..$ : chr [1:2000] "ENSG00000108849" "ENSG00000157005" "ENSG00000118785" "ENSG00000164692" ...
 $ train_indicator: int [1:22753, 1:2000] 1 1 0 1 0 1 1 1 1 1 ...
 $ confounder     : num [1:22753, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:22753] "SRR5818088-AAAAAAAAAAAA" "SRR5818088-AAAAAAAATAGG" "SRR5818088-AAAAGACGAACG" "SRR5818088-AAAAGGGCGAAC" ...
  .. ..$ : chr [1:2] "disease_id" "donnor_id"
 $ params         :List of 4
  ..$ global_tol : num 1e-08
  ..$ sub_tol    : num 1e-05
  ..$ tuning_iter: num 30
  ..$ max_iter   : num 10000
 $ cfd_matrices   :List of 2
  ..$ factor0: num [1:2, 1:11] -1.3428 -0.9747 -0.736 -0.7037 0.0449 ...
  ..$ factor1: num [1:9, 1:11] 0.282 0.282 -0.188 -0.11 0.123 ...
 $ column_factor  : num [1:11, 1:2000] -0.1435 -0.15236 -0.13395 0.00664 0.12703 ...
 $ cell_factor    : num [1:22753, 1:11] 0 -4.897 -0.172 0 0.597 ...
 $ gene_factor    : num [1:11, 1:2000] 0.00835 -0.0087 0.01543 0.01838 0.0063 ...
 - attr(*, "class")= chr "scParser"

The fitted object obtained from the above command is an R list object, containing the following elements:

log-transformed expression data matrix;
train_indicator: an indicator matrix for elements to be concluded as train set.
confounder matrix
params: parameter setting for scParser
cfd_matrices: a list of low-rank representations for biological variables. One can access the low-rank representation for a specific biological variable with the index of the variable in the confounder matrix.
column_factor: gene latent representation matrix of K * M, where K is the num_factors and M is the number of genes.
cell_factor: sparse representation for cells
gene_factor: gene latent representation matrix of K * M, where K is the num_factors and M is the number of genes.

Modeling the effect of biological conditions on gene expression for cell populations

object <- scParser(as.matrix(dataset[,-c(1,2)]), confounders, split_ratio = 0.1, tuning_iter = 20)

confounders: A confounder matrix. The elements of the matrix are used as indices to extract corresponding latent representation, so its elements are integer and greater than 0. To model the interaction between cell populations and biological conditions (e.g., disease status), the confounder matrix should contain a column with each element representing the combination of the cell population and biological condition the corresponding cell belongs. For example, if there are 5 cell populations and 2 levels for a biological condition, then each element of the column should take a value of integer from 1 to 10 to represent the combination of cell population and biological condition corresponding observation belongs to.
Other parameters available is the same as we introduced previously in creating scParser object, and the definition of the parameters is also the same.

To tune the parameters for the new model, the following can be employed.

object <- partial_tune(object, cfd_rank = as.integer(cfd_factor_num), lambda1 = c(0.1, 1, 10, 20, 30, 40, 50))

cfd_rank: An integer vector from which the rank of latent dimension for modeling variation from biological conditions is chosen.
lambda: A numeric vector from which the tuning parameter lambda is selected. $\lambda$ controls penalty for latent representations for biological variables.

Then, we can fit the new model with the parameters that performs the best in the previous step.

object <- partial_fit(object, cfd_rank = as.integer(11), lambda1 = 1)

The fitted object obtained from the above command is an R list object as follows:

> str(object)
List of 6
 $ data         : num [1:22753, 1:2000] 0 0 7.22 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:22753] "SRR5818088-AAAAAAAAAAAA" "SRR5818088-AAAAAAAATAGG" "SRR5818088-AAAAGACGAACG" "SRR5818088-AAAAGGGCGAAC" ...
  .. ..$ : chr [1:2000] "ENSG00000108849" "ENSG00000157005" "ENSG00000118785" "ENSG00000164692" ...
  $ confounder     : num [1:22753, 1:2] 3 2 6 5 5 5 5 5 2 4 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:22753] "SRR5818088-AAAAAAAAAAAA" "SRR5818088-AAAAAAAATAGG" "SRR5818088-AAAAGACGAACG" "SRR5818088-AAAAGGGCGAAC" ...
  .. ..$ : chr [1:2] "interaction_id" "donnor_id"
 $ split_ratio  : num 0.1
 $ params       :List of 4
  ..$ global_tol : num 1e-08
  ..$ sub_tol    : num 1e-05
  ..$ tuning_iter: num 20
  ..$ max_iter   : num 50000
 $ cfd_matrices :List of 1
  ..$ factor0: num [1:12, 1:11] -0.537 -0.985 -0.39 0.74 -1.168 ...
  ..$ factor1: num [1:9, 1:11] 0.272 0.241 -0.158 -0.08 0.103 ...
 $ column_factor: num [1:11, 1:2000] 1.527 -0.611 -0.673 1.339 0.426 ...
 - attr(*, "class")= chr "scParser"

cfd_matrices: the first element of the cfd_matrices is the latent representations for different combinations of the cell population and biological conditions.
The meaning of other elements is the same as we introduced previously.

For Details of downstream analysis with results from scParser, please refer to preprint in references.

License

The software is released under MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

References

Zhao, K., So, H. C., & Lin, Z. (2024). scParser: sparse representation learning for scalable single-cell RNA sequencing data analysis. Genome Biology, 25(1), 1-28.