scButterfly: a versatile single-cell cross-modality translation method via dual-aligned variational autoencoders

Installation

It's prefered to create a new environment for scButterfly

conda create -n scButterfly python==3.9
conda activate scButterfly

scButterfly is available on PyPI, and could be installed using

pip install scButterfly

Installation via Github is also provided

git clone https://github.com/Biox-NKU/scButterfly
cd scButterfly
pip install scButterfly-0.0.9-py3-none-any.whl

This process will take approximately 5 to 10 minutes, depending on the user's computer device and internet connectivition.

Quick Start

Illustrating with the translation between scRNA-seq and scATAC-seq data as an example, scButterfly could be easily used following 3 steps: data preprocessing, model training, predicting and evaluating. More details could be find in scButterfly documents.

Generate a scButterfly model first with following process:

from scButterfly.butterfly import Butterfly
butterfly = Butterfly()

1. Data preprocessing

Before data preprocessing, you should load the raw count matrix of scRNA-seq and scATAC-seq data via butterfly.load_data:

butterfly.load_data(RNA_data, ATAC_data, train_id, test_id, validation_id)

Parameters	Description
RNA_data	AnnData object of shape `n_obs` × `n_vars`. Rows correspond to cells and columns to genes.
ATAC_data	AnnData object of shape `n_obs` × `n_vars`. Rows correspond to cells and columns to peaks.
train_id	A list of cell IDs for training.
test_id	A list of cell IDs for testing.
validation_id	An optional list of cell IDs for validation, if setted None, butterfly will use a default setting of 20% cells in train_id.

Anndata object is a Python object/container designed to store single-cell data in Python packege anndata which is seamlessly integrated with scanpy, a widely-used Python library for single-cell data analysis.

For data preprocessing, you could use butterfly.data_preprocessing:

butterfly.data_preprocessing()

You could save processed data or output process logging to a file using following parameters.

Parameters	Description
save_data	optional, choose save the processed data or not, default False.
file_path	optional, the path for saving processed data, only used if `save_data` is True, default None.
logging_path	optional, the path for output process logging, if not save, set it None, default None.

scButterfly also support to refine this process using other parameters (more details on scButterfly documents), however, we strongly recommend the default settings to keep the best result for model.

2. Model training

Before model training, you could choose to use data augmentation strategy or not. If using data augmentation, scButterfly will generate synthetic samgles with the use of cell-type labels(if cell_type in adata.obs) or cluster labels get with Leiden algorithm and MultiVI, a single-cell multi-omics data joint analysis method in Python packages scvi-tools.

scButterfly provide data augmentation API:
```
butterfly.augmentation(aug_type)
```
You could choose parameter aug_type from cell_type_augmentation or MultiVI_augmentation, this will cause more training time used, but promise better result for predicting.
- If you choose cell_type_augmentation, scButterfly-T (Type) will try to find cell_type in adata.obs. If failed, it will automaticly transfer to MultiVI_augmentation.
- If you choose MultiVI_augmentation, scButterfly-C (Cluster) will train a MultiVI model first.
- If you just want to using original data for scButterfly-B (Basic) training, set aug_type = None.

You could construct a scButterfly model as following:

butterfly.construct_model(chrom_list)

scButterfly need a list of peaks count for each chromosome, remember to sort peaks with chromosomes.

Parameters	Description
chrom_list	a list of peaks count for each chromosome, remember to sort peaks with chromosomes.
logging_path	optional, the path for output model structure logging, if not save, set it None, default None.

scButterfly model could be easily trained as following:

butterfly.train_model()

Parameters	Description
output_path	optional, path for model check point, if None, using './model' as path, default None.
load_model	optional, the path for load pretrained model, if not load, set it None, default None.
logging_path	optional, the path for output training logging, if not save, set it None, default None.

scButterfly also support to refine the model structure and training process using other parameters for butterfly.construct_model() and butterfly.train_model() (more details on scButterfly documents).

3. Predicting and evaluating

scButterfly provide a predicting API, you could get predicted profiles as follow:

A2R_predict, R2A_predict = butterly.test_model()

A series of evaluating method also be integrated in this function, you could get these evaluation using parameters:

Parameters	Description
output_path	optional, path for model evaluating output, if None, using './model' as path, default None.
load_model	optional, the path for load pretrained model, if not load, set it None, default False.
model_path	optional, the path for pretrained model, only used if `load_model` is True, default None.
test_cluster	optional, test the correlation evaluation or not, including AMI, ARI, HOM, NMI, default False.
test_figure	optional, draw the tSNE visualization for prediction or not, default False.
output_data	optional, output the prediction to file or not, if True, output the prediction to `output_path/A2R_predict.h5ad` and `output_path/R2A_predict.h5ad`, default False.

Demo, document, tutorial and source code

We provide demos of basic scButterfly model and two variants (scButterfly-C and scButterfly-T) illustrating with CL datasets in scButterfly-B usage, scButterfly-C usage, and scButterfly-T usage, with data presented in Google drive. scButterfly-B, scButterfly-C and scButterfly-T repectively take about 12, 24, 18 minutes for the whole process (containing pre-processing, data augmentation, model training and evaluating) on desktop computer with NVIDIA RTX A6000 GPU.

We also provide richer tutorials and documents for scButterfly in scButterfly documents, including more details of provided APIs for customing data preprocessing, model structure and training strategy. The source code of experiments for scButterfly is available at source code, including more detailed source code for scButterfly.