This repository contains newly developed models for improving the integration of scRNA-seq datasets with substantial batch effects and reproducibility code for Hrovatin et al. (2023).
See also our talk on the M2D2 series.
Our prefered model with the VampPrior and latent cycle-consistency loss is implemented in scvi-tools under the name sysVI (see below for details).
Computational methods for integrating scRNA-seq datasets often struggle to unify datasets with strong differences driven by technical or biological variation, such as between different species, organoids and primary tissue, or different scRNA-seq protocols, including single-cell and single-nuclei. Since many popular and scalable batch effect approaches are based on conditional variational autoencoders (cVAE), we hypothesize that machine learning interventions to standard cVAEs can help to improve batch effect removal while potentially better preserving biological variation. For this, we evaluate four strategies applied to commonly used cVAE models: the previously proposed Kullback–Leibler divergence (KL) regularization tuning and adversarial learning, as well as cycle-consistency loss (previously applied to multi-omic integration), and the multimodal variational mixture of posteriors prior (VampPrior) that has not yet been applied to integration. We evaluated performance in three settings, namely cross-species, organoid-tissue, and cell-nuclei integration. Cycle-consistency and VampPrior improved batch correction while retaining high biological preservation, with their combination further increasing the performance. While adversarial learning led to the strongest batch correction, its preservation of within-cell type variation did not match that of VampPrior or cycle-consistency models and it was also prone to mixing unrelated cell types with differences in proportions across batches. KL regularization strength tuning did not perform well, as it jointly removed biological and batch variation by reducing the number of effectively used embedding dimensions. Based on our results, we recommend the use of the VampPrior combined with cycle-consistency loss for integrating datasets with substantial batch effects.
Figure 1: The challenge of integrating datasets with substantial batch effects. (a) Substantial batch effects are present between different biological “systems”, such as cross-species, organoid-tissue, and cell-nuclei datasets, which differ more substantially than datasets commonly used for integration, such as biologically similar samples generated by different laboratories. (b) Integrating datasets with substantial batch effects poses a bigger challenge than integrating datasets of similar samples across laboratories, where the batch effect is smaller. In this study, we evaluate different approaches for improving cVAE-based batch correction. (c) An example of substantial batch effects. Shown are Euclidean distance distributions in PCA space between mean embeddings of a cell type (dealt cells) from samples within a dataset, samples between datasets within a system (mouse or human), and samples between different systems. (d) Overview of approaches for increasing batch removal strength in cVAE-based models: KL-loss-based regularization of the latent space, the use of the VampPrior as a replacement for the standard Gaussian prior, and adversarial learning and cycle-consistency loss that actively push together samples from different systems. Parts of the cVAE model were omitted from individual panels for brevity.
We suggest using the scvi-tools implementation of the sysVI model as this repository will remain for reproducibility purposes only and will not be maintained in the future. The sysVI model is expected to be merged in the scvi-tools version 1.2. Until then please use this fork and the following tutorial - note that some of the parameters are changed from the original implementation to adhere to the scvi-tools terminology.
Clone the git repository and run pip install -e .
from within the cloned repository.
- Implementation of the model with VampPrior and cycle-consistency: cross_system_integration directory
- Model analysis and comparison: notebooks directory
- Environments: envs directory
For model use instructions see readme in the model (cross_system_integration ) directory.