Note 🚧 This repository is under construction. This note will disappear as soon as all the all the single-cell transformer paper tables are added.
This repository accompanies Transformers in Single-Cell Omics: A Review and New Perspectives. Please refer to the manuscript for the details.
We provide a curated list of single-cell transformers and their evaluation results. We skip models that work only on bulk or images of slides data and those where transformers are used only as a part of the model. Models focusing on sequential data, such as DNA or protein sequences are omitted too. New entries are added at the top of the corresponding table.
We welcome contributions to this repository. Please open a pull request or an issue if you want to add or edit an entry.
Model | Paper | Code | Omic Modalities | Pre-training Dataset | Input Embedding | Architecture | SSL Tasks | Supervised Tasks | Zero-shot Tasks |
---|---|---|---|---|---|---|---|---|---|
scMulan | 📄(RECOMB)Bian et al. 2024 | 🔍Github | scRNA-Seq | 10M / cross-tissue, human (hECA) | Not specified | Decoder | Conditional cell generation | cell type annotation, cell metadata annotation (both also used in training) | Batch integration |
BioFormers | 📝Belgadi and Li et al. 2023 | None | scRNA-Seq | 8K / single tissue, human (PBMC, Adamson et al. 2016) | Value categorization: value binning | Encoder | MLM with CE loss | None | Cell clustering, gene expression imputation, genetic perturbation effect prediction, GRN inference |
Geneformer | 📄(Nature)Theodoris et al. 2023 | 🛠🤗 | scRNA-Seq | 36M / cross-tissue, human (Genecorpus) | Ordering: rank-based | Encoder | MLM with CE loss, gene ID prediction | Gene function prediction, cell annotation | Cell clustering, GRN inference |
Universal Cell Embedding | 📝Rosen et al. 2023 | 🔍Github | scRNA-Seq | 36M / cross-tissue, cross-species (CELLxGENE and other) | Other: ESM-2 based gene embeddings. Gene embeddings are sampled according to expression levels and order determined by position on chromosomes. | Encoder | Modified MLM, binary CE loss predicting whether a gene is expressed or not. Uses CLS embedding instead of token-embeddings. | Cell annotation | Cell clustering, cross-species integration |
scGPT | 📄(Nature Meth)Cui et al. 2024 | 🔍GitHub | scRNA-Seq, scATAC-Seq, CITE-Seq, Spatial transcriptomics | 33M / cross-tissue, human, non-disease (CELLxGENE) | Value categorization: value binning | Other: attention masking in encoder | Iterative MLM variant with MSE loss, cell token expression prediction, gene expression prediction | Cell type annotation, genetic perturbation effect prediction, reverse perturbation prediction, cell clustering, multimodal embedding, gene function prediction | Cell clustering, GRN inference, simulation, gene expression imputation |
TOSICA | 📄(Nature Comms)Chen et al. 2023 | 🛠️GitHub | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
scMoFormer | 📄(ACM)Tang et al. 2023 | 🛠️GitHub | scRNA-Seq, scATAC-Seq, CITE-Seq | None | Other, SVD-based | Encoder and graph transformers | None | Cross-modality prediction | None |
tGPT | 📄(Cell iScience)Shen et al. 2023 | 🛠GitHub️ | scRNA-Seq | 22M / cross-tissue, cross-species, disease and non-disease, organoids (list) | Ordering | Decoder | NTP with CE loss, gene ID prediction | None | Cell clustering, trajectory inference |
SpaFormer | 📝Wen et al. 2023 | 🛠️GitHub | Spatial transcriptomics | None | Cells as tokens, value projection | Encoder | Modified MLM with MSE loss, gene expression prediction | Gene expression imputation | Cell clustering |
scFoundation | 📄(Nature Meth)Hao et al. 2024 and 📄(NeurIPS)Gong et al. 2023 | 🔍GitHub | scRNA-Seq | 50M / cross-tissue, human, disease and non-disease (GEO, Single Cell Portal, HCA, EMBL-EBI) | Value projection | Other: two encoders | Modified MLM with MSE loss, gene expression prediction | Drug response prediction, genetic perturbation effect prediction | Read depth enhancement, cell clustering |
CellLM | 📝Zhao et al. 2023 | 🔍GitHub | scRNA-Seq | 1.8M / cross-tissue, human, disease and non-disease (PanglaoDB, CancerSCEM) | Value categorization | Encoder | Contrastive loss, MLM with CE loss | Non-disease vs cancer prediction, cell type annotation, drug response prediction | None |
scCLIP | 📝Xiong et al. 2023 | 🛠️GitHub | scRNA-Seq, scATAC-seq | 377k / cross-tissue, human fetal (ATAC, RNA) | Value projection | Encoder | Contrastive loss, CE matching modalities | None | Multimodal embedding |
GeneCompass | 📝Yang et al. 2023 | Partial 🛠 GitHub | scRNA-Seq | 126M / cross-tissue, human and mouse, disease and non-disease (GEO, SRA, CELLxGENE, GSA, Single Cell Portal, HCA, EMBL-EBI, 3CA, Cell BLAST, TEDD, and other) | ? | Other: two encoders | MLM with CE and MSE loss, gene ID and expression prediction | Cell type annotation, drug response prediction, gene function prediction | Cross-species integration, genetic perturbation effect prediction, GRN inference |
CellPLM | 📄(ICLR)Wen et al. 2024 | Partial 🔍GitHub | scRNA-Seq, Spatial transcriptomics | 11M / cross-tissue, human, disease and non-disease (HTCA, HCA, GEO) | Cells as tokens, value projection | Encoder | Modified MLM with MSE loss and KL losses, gene expression prediction | Gene expression imputation, cell type annotation, genetic perturbation effect prediction | Cell clustering, scRNA-Seq denoising |
scMAE | 📝Kim et al. 2023 | None | single-cell flow cytometry | 6.5M / human, disease and non-disease (source?) | Other, concatenation of values with learnable protein embeddings | Other: two encoders | MLM with MSE loss, protein expression prediction | Cell type annotation, protein expression imputation | None |
CAN/CGRAN | 📄(Frontiers)Wang et al. 2023 | None | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
scTranslator | 📝Liu et al. 2023 | 🔍️GitHub | scRNA-Seq, CITE-Seq | None | Value projection | Other: two encoders | None | Cross-modality prediction | (After cross-modality prediction training) GRN inference, cell clustering |
scTransSort | 📄(MDPI)Jiao et al. 2023 | 🛠️GitHub | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
STGRNS | 📄(OUP)Xu et al. 2023 | 🛠️GitHub | scRNA-Seq | None | Other | Encoder | None | GRN inference | None |
CIForm | 📄(OUP)Xu et al. 2023 | 🛠️GitHub | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
scFormer | 📝Cui et al. 2023 | Incomplete ️GitHub | scRNA-Seq | Task specific | Value categorization: value binning | Encoder | Modified MLM with CE, cell token expression prediction, contrastive loss with cosine similarity, gene expression prediction | Cell type annotation, genetic perturbation effect prediction | Cell clustering |
Exceiver | 📝Connell et al. 2022 | 🛠️GitHub | scRNA-Seq | 0.5M / cross-tissue, human (Tabula Sapiens) | Other: value scaled embeddings | Encoder | Modified MLM with MSE, gene expression prediction | Cell type annotation, drug response prediction | Cell clustering |
TransCluster | 📄(Frontiers)Song et al. 2022 | 🛠️GitHub | scRNA-Seq | None | Value projection with LDA | Encoder | None | Cell type annotation | None |
scBERT | 📄(Nature MI)Yang et al. 2022 | 🔍GitHub | scRNA-Seq | 1M / cross-tissue, human (PanglaoDB) | Value categorization, binning | Encoder | MLM with CE loss, gene expression prediction | Cell type annotation, unseen cell type detection | None |
iSEEEK | 📄(OUP)Shen et al. 2022 | 🔍Github (dataset not public) | scRNA-Seq | 11.9M / cross-tissue, cross-species (list) | Ordering: rank-based | Encoder | MLM with CE loss | Marker gene classification | Cell clustering, pseudotime analysis, GRN inference |
Multitask learning | 📝Pang et al. 2020 | None | scRNA-Seq | 160k / brain, mouse (MBA) | Value projection | Other: autoencoder with two transformer encoders (?) | Modified MLM with MSE loss, gene expression prediction | None | Cell clustering |
Model | Paper | Code | Omic Modalities | Pre-training Dataset | Input Embedding | Architecture | SSL Tasks | Supervised Tasks | Zero-shot Tasks |
---|---|---|---|---|---|---|---|---|---|
scInterpreter | 📝Li et al. 2024 | None | scRNA-Seq | Natural Language GPT-3.5 and Llama-13b | Other: Ordering with embedding of the natural language representation | Decoder, GPT-3.5 and Llama-13b | NTP with CE loss and instruction finetuning (GPT-3.5 closed-source) | None | Cell type annotation (LLMs frozen, only small MLP trained) |
ChatCell | 📝❌Fang et al. 2024 | 🛠GitHub | scRNA-Seq | Natural Language T5 and natural language instructions | Other: Ordering with embedding as natural language with additional terms | Encoder-Decoder, T5 | NTP with CE loss | None (conditional sequence generation, prompting) | Simulation, cell type annotation, drug sensitivity prediction |
MarkerGeneBERT | 📝Cheng et al. 2023 | None | scRNA-Seq | Natural Language, PubMed and PubMed Central | Other: Natural language preprocessed with SciBERT | Encoder | MLM | Named Entity Recognition (NER), cell-biomarker sentence classification | None |
scELMo | 📝Liu, Chen and Zheng 2023 | Partial 🔍GitHub | scRNA-Seq, CITE-Seq | Natural Language, Closed source | Other: NLP model embeddings of features weighted by the feature level in a cell (e.g. expression level) | Closed source (some open) | Closed source (some open) | Cell type annotation, Genetic perturbation effect prediction | Cell and gene embeddings in other perturbation models |
GenePT | 📝Chen and Zou 2023 | Partial 🔍GitHub | scRNA-Seq | Natural Language, Closed source | Ordering: embedding as natural language | Closed source | Closed source | Gene function prediction | Cell clustering, GRN inference |
GPT-4 | 📄(Nature Meth)W. Hou and Z. Ji 2024 | 🔍GitHub | scRNA-Seq | Natural Language, Closed source | Ordering: embedding as natural language | Closed source | Closed source | None (coditional sequence generation, prompting) | Cell type annotation |
Cell2Sentence | 📄(ICML)Levine et al. 2024 | 🛠️GitHub | scRNA-Seq | Natural Language (GPT2) and scRNA-Seq (40k / immune, human) | Ordering: embedding as natural language | Decoder | NTP with CE loss | None | Simulation, cell type annotation |
Paper | Code | Omic Modalities | Evaluated Transformers | Tasks | Notes |
---|---|---|---|---|---|
📝He et al. 2024 | 🛠️GitHub | scRNA-Seq | scGPT | Cell type annotation | Evaluation of Parameter-Efficient Fine-Tuning (PEFT) for scGPT. Indicates that PEFT not only is more compute-efficient, but also results in better cell type prediction. |
📄(Nature MI)Khan et al. 2023 | 🛠️GitHub | scRNA-Seq | scBERT | Cell type annotation. Unseen cell type detection | Focused on imbalanced cell type classification. scBERT is sensitive to class imbalance. scBERT outperforms Seurat. scBERT doesn't perform well in unseen cell type detection. It benefits from SSL pretraining. |
📝Liu et al. 2023 | 🛠️GitHub | scRNA-Seq, scATAC-Seq, Spatial transcriptomics | scGPT, Geneformer, scBERT, tGPT, CellLM | Cell clustering, cell type annotation, multimodal embedding, GRN inference, gene expression imputation, genetic perturbation effect prediction, simulation, gene function prediction | Models aren't trained on the same datasets. scGPT is positioned as most versatile in terms of task diversity that it can tackle. Models other than transformer appear to be at least as good as transformers in most tasks. Transformers were shown to be sensitive to the choice of hyperparameters, such as learning rate and epochs. |
📝Boiarsky et al. 2023 | 🛠️GitHub | scRNA-Seq | scBERT, scGPT | Cell type annotation | Logistic regression appears to be as good as transformers in cell type annotation, even in low-data scenarios. |
📝Kedzierska et al. 2023 | 🛠️GitHub | scRNA-Seq | scGPT, Geneformer | Cell clustering | Zero-shot performance only. Both models appear unreliable. |
📝Alsabbagh et al. 2023 | 🛠️GitHub | scRNA-Seq | scGPT, Geneformer, scBERT | Cell type annotation | Focused on imbalanced cell type classification. Geneformer appears to be outperformed by scGPT and scBERT, where the two latter perform similarly. |
- 📝 - Preprint
- 📄 - Peer-Reviewed Publication
- 🛠️ - Fully reproducible
- 🔍 - Code for evaluation only
- ❌ - Retracted or withdrawn
If you find the the data in this repository useful for your work, please cite:
@Article{TBA}