A novel method for biomarker discovery that leverages deep neural network sensitivity analysis and an attention mechanism to select pivotal multi-omics biomarkers without relying on differential expression.
This repository contains the official implementation of the paper:
scSniper introduces a novel approach to biomarker discovery, leveraging deep neural network sensitivity analysis to pinpoint key gene biomarkers beyond traditional differential expression methods. It stands out by utilizing a mimetic attention mechanism, which allows for the integration of multi-omic data, highlighting critical biomarkers across genomics, proteomics, and metabolomics. This mechanism prioritizes important multi-omic features, enabling a comprehensive analysis that other single-omic focused tools may miss.The tool innovatively merges a disease classifier and an autoencoder, producing joint cell embeddings that represent disease-specific multi-omic profiles, improving single-cell resolution clustering and biomarker identification accuracy. scSniper's method represents a significant leap in biomarker discovery, offering a more integrative and precise approach to understanding complex diseases.scSniper: Single-cell Deep Neural Network-basd Identification of Prominent Biomarkers [Mingyang Li],[Yanshuo Chen],[Jun Ding]
scSniper requires Python 3.9 or higher. The following packages are required:
pip install -r requirements.txt
By default, torch will be installed with CUDA support only on Linux. If you are using Windows or macOS. Instead of the above command, please install the appropriate PyTorch package for your system from here. Then run the following command instead:
pip install -r requirements_non_linux.txt
We assumed your data are ready preprocessed. If not, we recommand:
For RNA modality: use sc.pp.filter_genes
, sc.pp.filter_cells
, sc.pp.normalize_total
, sc.pp.log1p
to preprocess your data. You can view the scanpy tutorial for more details
For ADT modality: use CTL
normalization. You can use muon package to do this. You can view the muon tutorial for more details.
For ATAC modality: filter peak first and binarize your data.
For all other modalities: we believe the preprocessing steps are similar to the above modalities. If you have any questions, please open an issue.
We also assumed non-RNA modality is stored in obsm
of the AnnData object as a dataframe where the column names are the feature names and the row names are the cell names. The row order should be the same as the RNA modality. If you have any questions, please open an issue.
Required arguments:
--data
: Path to the data file. We assumed it's a H5AD file.--modality_keys
: The keys of the modalities in the data file. For example, if the data file contains RNA and ATAC data, then the keys should beX
for RNA, and the obsm key for ATAC. Example:--modality_keys="{RNA:"X","ATAC":"ATAC_data"}"
.--class_label
: The key of the disease class label in the data file. Example:--label_key=patient_cat
.--output_path
: Path to the output folder.--num_class
: The number of disease classes. Example:--num_class=2
.--encoder_dict
: The output dimensions of the encoder layers. The last input dimension is the latent dimension of the modality. Input dimensions are automatically calculated. Example--encoder_dict="{RNA:[128,128,64],ATAC:[128,64,32]}"
.--decoder_dict
: The output dimensions of the decoder layers. The last input dimension is the latent dimension of the modality. Input dimensions are automatically calculated. Example--decoder_dict="{RNA:[128,128],ATAC:[64,64,128]}"
.
Optionally, you can specify the following arguments:
--batch_size
: The batch size. Default:--batch_size=128
.--learning_rate
: The learning rate. Default:--lr=1e-3
.--categorical_covariate
: The categorical covariate, such as batch_label. Default:--categorical_covariate=None
.--classifier_interlayers_dims
: The dimensions of the classifier interlayers. Default:--classifier_interlayers_dims="{Classifier:[32,10]}"
.
We listed important arguments here. You can find all arguments in train.py
with python train.py --help
.
python train.py --your_arguments=your_values ...
The program will automatically save the result in a pickle file named cell_type_loss_change.pkl
to the output folder. You can run python eval.py --help
to find biomarkers as illustrated below.
Required arguments:
--data
: Path to the data file. We assumed it's a H5AD file.--output
: Path to the output folder to save the biomarkers.--modality_keys
: The keys of the modalities in the data file. For example, if the data file contains RNA and ATAC data, then the keys should beX
for RNA, and the obsm key for ATAC. Example:--modality_keys="{RNA:"X","ATAC":"ATAC_data"}"
.--result_file
: The path to the pickle file generated bytrain.py
. Example:--result_file=cell_type_loss_change.pkl
.--num_features
: The number of biomarkers to select per modality. Example:--num_features="{RNA:100,ATAC:100}"
.
python eval.py --your_arguments=your_values ...
Always use python eval.py --help
to see all the arguments and their default values.
This will create folders for each cell_type
in the output folder. The folder contains the following files:
modality_1
: The biomarkers for modality 1. A text file where each line is a biomarker.- ...
modality_n
: The biomarkers for modality n. A text file where each line is a biomarker.Joint
: The cross-modality biomarkers. A text file where each line is a biomarker.
- Add distributed training (more than one GPU)
- Add option to use different optimizers
This work was funded in part by grants awarded to [JD]. We gratefully acknowledge the support from the Canadian Institutes of Health Research (CIHR) under Grant Nos. PJT-180505; the Funds de recherche du Québec - Santé (FRQS) under Grant Nos. 295298 and 295299; the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grant No. RGPIN2022-04399; and the Meakins-Christie Chair in Respiratory Research.