This repository accompanies Transformers in Single-Cell Omics: A Review and New Perspectives. Please refer to the manuscript for the details.
We provide a curated list of single-cell transformers and their evaluation results. We skip models that work only on bulk or images of slides data and those where transformers are used only as a part of the model. Models focusing on sequential data, such as DNA or protein sequences are omitted too. New entries are added at the top of the corresponding table.
We welcome contributions to this repository. Please open a pull request or an issue if you want to add or edit an entry.
Model | Paper | Code | Omic Modalities | Pre-training Dataset | Input Embedding | Architecture | SSL Tasks | Supervised Tasks | Zero-shot Tasks |
---|---|---|---|---|---|---|---|---|---|
Precious3GPT | 📝Galkin et al. 2024 | Partial 🔍️🤗 | Bulk/scRNA-seq, DNAm, proteomics, natural language annotations | Omics data with KG and text embeddings, Closed source | ? | Decoder-only LLaMA-like transformer model with modality mapper units | Emulation of chemical response, cross-species/tissue/omics transference, emulation of clinical conditions | Age prediction, gene classification | DEG prediction, Indication discovery |
LangCell | 📄(ICML)Zhao et al. 2024 | 🛠️Github | scRNA-seq, natural language | 27M / cross-tissue, human (CELLxGENE) | Ordering: rank-based, natural language cell description | Other: two encoders (cell and text) | MLM with CE loss, intra- and inter-modal contrastive loss, cell-text matching with CE loss | Cell type annotation, pathway identification | Novel cell type identification, NSCLC subtype classification, batch integration, cell clustering |
ScRAT | 📄(Bioinformatics)Mao et al. 2024 | 🛠️GitHub | scRNA-seq | None | Cells as tokens | Encoder | None | Phenotype prediction: aggregated per sample cell embeddings are used to predict sample label (e.g., health condition) | None |
scPRINT | 📝Kalfon et al. 2024 | 🛠️Github | scRNA-seq | 50M / cross-tissue, cross-species (CELLxGENE) | Other: ESM-2 based gene embeddings. Gene embeddings are randomly sampled and order determined by position on chromosomes | Encoder | Multi task Pre-training: Denoising, Botleneck learning (+ many additional losses available) | Cell label prediction (these supervised tasks are part of the pre-training) | Read depth enhancement, gene expression imputation, Batch Integration, Cell Clustering, Cell Label Prediction, GRN inference |
scMulan | 📄(RECOMB)Bian et al. 2024 | 🔍Github | scRNA-seq | 10M / cross-tissue, human (hECA) | Not specified | Decoder | Conditional cell generation | cell type annotation, cell metadata annotation (both also used in training) | Batch integration |
BioFormers | 📝Belgadi and Li et al. 2023 | None | scRNA-seq | 8K / single tissue, human (PBMC, Adamson et al. 2016) | Value categorization: value binning | Encoder | MLM with CE loss | None | Cell clustering, gene expression imputation, genetic perturbation effect prediction, GRN inference |
Geneformer | 📄(Nature)Theodoris et al. 2023 | 🛠🤗 | scRNA-seq | 36M / cross-tissue, human (Genecorpus) | Ordering: rank-based | Encoder | MLM with CE loss, gene ID prediction | Gene function prediction, cell annotation | Cell clustering, GRN inference |
Universal Cell Embedding | 📝Rosen et al. 2023 | 🔍Github | scRNA-seq | 36M / cross-tissue, cross-species (CELLxGENE and other) | Other: ESM-2 based gene embeddings. Gene embeddings are sampled according to expression levels and order determined by position on chromosomes. | Encoder | Modified MLM, binary CE loss predicting whether a gene is expressed or not. Uses CLS embedding instead of token-embeddings. | Cell annotation | Cell clustering, cross-species integration |
scGPT | 📄(Nature Meth)Cui et al. 2024 | 🔍GitHub | scRNA-seq, scATAC-seq, CITE-seq, Spatial transcriptomics | 33M / cross-tissue, human, non-disease (CELLxGENE) | Value categorization: value binning | Other: attention masking in encoder | Iterative MLM variant with MSE loss, cell token expression prediction, gene expression prediction | Cell type annotation, genetic perturbation effect prediction, reverse perturbation prediction, cell clustering, multimodal embedding, gene function prediction | Cell clustering, GRN inference, simulation, gene expression imputation |
TOSICA | 📄(Nature Comms)Chen et al. 2023 | 🛠️GitHub | scRNA-seq | None | Value projection | Encoder | None | Cell type annotation | None |
scMoFormer | 📄(ACM)Tang et al. 2023 | 🛠️GitHub | scRNA-seq, scATAC-seq, CITE-seq | None | Other, SVD-based | Encoder and graph transformers | None | Cross-modality prediction | None |
tGPT | 📄(Cell iScience)Shen et al. 2023 | 🛠GitHub️ | scRNA-seq | 22M / cross-tissue, cross-species, disease and non-disease, organoids (list) | Ordering | Decoder | NTP with CE loss, gene ID prediction | None | Cell clustering, trajectory inference |
SpaFormer | 📝Wen et al. 2023 | 🛠️GitHub | Spatial transcriptomics | None | Cells as tokens, value projection | Encoder | Modified MLM with MSE loss, gene expression prediction | Gene expression imputation | Cell clustering |
scFoundation | 📄(Nature Meth)Hao et al. 2024 and 📄(NeurIPS)Gong et al. 2023 | 🔍GitHub | scRNA-seq | 50M / cross-tissue, human, disease and non-disease (GEO, Single Cell Portal, HCA, EMBL-EBI) | Value projection | Other: two encoders | Modified MLM with MSE loss, gene expression prediction | Drug response prediction, genetic perturbation effect prediction | Read depth enhancement, cell clustering |
CellLM | 📝Zhao et al. 2023 | 🔍GitHub | scRNA-seq | 1.8M / cross-tissue, human, disease and non-disease (PanglaoDB, CancerSCEM) | Value categorization | Encoder | Contrastive loss, MLM with CE loss | Non-disease vs cancer prediction, cell type annotation, drug response prediction | None |
scCLIP | 📝Xiong et al. 2023 | 🛠️GitHub | scRNA-seq, scATAC-seq | 377k / cross-tissue, human fetal (ATAC, RNA) | Value projection | Encoder | Contrastive loss, CE matching modalities | None | Multimodal embedding |
GeneCompass | 📝Yang et al. 2023 | Partial 🛠 GitHub | scRNA-seq | 126M / cross-tissue, human and mouse, disease and non-disease (GEO, SRA, CELLxGENE, GSA, Single Cell Portal, HCA, EMBL-EBI, 3CA, Cell BLAST, TEDD, and other) | ? | Other: two encoders | MLM with CE and MSE loss, gene ID and expression prediction | Cell type annotation, drug response prediction, gene function prediction | Cross-species integration, genetic perturbation effect prediction, GRN inference |
CellPLM | 📄(ICLR)Wen et al. 2024 | Partial 🔍GitHub | scRNA-seq, Spatial transcriptomics | 11M / cross-tissue, human, disease and non-disease (HTCA, HCA, GEO) | Cells as tokens, value projection | Encoder | Modified MLM with MSE loss and KL losses, gene expression prediction | Gene expression imputation, cell type annotation, genetic perturbation effect prediction | Cell clustering, scRNA-seq denoising |
scMAE | 📝Kim et al. 2023 | None | single-cell flow cytometry | 6.5M / human, disease and non-disease (source?) | Other, concatenation of values with learnable protein embeddings | Other: two encoders | MLM with MSE loss, protein expression prediction | Cell type annotation, protein expression imputation | None |
CAN/CGRAN | 📄(Frontiers)Wang et al. 2023 | None | scRNA-seq | None | Value projection | Encoder | None | Cell type annotation | None |
scTranslator | 📝Liu et al. 2023 | 🔍️GitHub | scRNA-seq, CITE-seq | None | Value projection | Other: two encoders | None | Cross-modality prediction | (After cross-modality prediction training) GRN inference, cell clustering |
scTransSort | 📄(MDPI)Jiao et al. 2023 | 🛠️GitHub | scRNA-seq | None | Value projection | Encoder | None | Cell type annotation | None |
STGRNS | 📄(OUP)Xu et al. 2023 | 🛠️GitHub | scRNA-seq | None | Other | Encoder | None | GRN inference | None |
CIForm | 📄(OUP)Xu et al. 2023 | 🛠️GitHub | scRNA-seq | None | Value projection | Encoder | None | Cell type annotation | None |
scFormer | 📝Cui et al. 2023 | Incomplete ️GitHub | scRNA-seq | Task specific | Value categorization: value binning | Encoder | Modified MLM with CE, cell token expression prediction, contrastive loss with cosine similarity, gene expression prediction | Cell type annotation, genetic perturbation effect prediction | Cell clustering |
Exceiver | 📝Connell et al. 2022 | 🛠️GitHub | scRNA-seq | 0.5M / cross-tissue, human (Tabula Sapiens) | Other: value scaled embeddings | Encoder | Modified MLM with MSE, gene expression prediction | Cell type annotation, drug response prediction | Cell clustering |
TransCluster | 📄(Frontiers)Song et al. 2022 | 🛠️GitHub | scRNA-seq | None | Value projection with LDA | Encoder | None | Cell type annotation | None |
scBERT | 📄(Nature MI)Yang et al. 2022 | 🔍GitHub | scRNA-seq | 1M / cross-tissue, human (PanglaoDB) | Value categorization, binning | Encoder | MLM with CE loss, gene expression prediction | Cell type annotation, unseen cell type detection | None |
iSEEEK | 📄(OUP)Shen et al. 2022 | 🔍Github (dataset not public) | scRNA-seq | 11.9M / cross-tissue, cross-species (list) | Ordering: rank-based | Encoder | MLM with CE loss | Marker gene classification | Cell clustering, pseudotime analysis, GRN inference |
Multitask learning | 📝Pang et al. 2020 | None | scRNA-seq | 160k / brain, mouse (MBA) | Value projection | Other: autoencoder with two transformer encoders (?) | Modified MLM with MSE loss, gene expression prediction | None | Cell clustering |
Model | Paper | Code | Omic Modalities | Pre-training Dataset | Input Embedding | Architecture | SSL Tasks | Supervised Tasks | Zero-shot Tasks |
---|---|---|---|---|---|---|---|---|---|
CELLama | 📝Choi et al. 2024 | 🛠GitHub | scRNA-seq, Spatial transcriptomics | Natural Language SBERT | Other: Ordering with embedding of the natural language representation, additional cell annotations are added in natural language | Siamese encoders (SBERT) | Contrastive loss | Cell type annotation | Cell type annotation, niche cell type featuring |
CellWhisperer | 📝Schaefer et al. 2024 | Soon | Bulk/scRNA-seq | Transcriptome data paired with natural language annotations | Geneformer- and BioBERT-based embedding models (contrastively fine-tuned) | Multimodal contrastive training of embedding models (CLIP) and transcriptome instruction fine-tuning of LLM (LLaVA) | None | Transcriptome-aware question-answering | Reference-free cell property prediction (cell types & states, disease states, organ of cell origin, ...) |
scInterpreter | 📝Li et al. 2024 | None | scRNA-seq | Natural Language GPT-3.5 and Llama-13b | Other: Ordering with embedding of the natural language representation | Decoder, GPT-3.5 and Llama-13b | NTP with CE loss and instruction finetuning (GPT-3.5 closed-source) | None | Cell type annotation (LLMs frozen, only small MLP trained) |
ChatCell | 📝❌Fang et al. 2024 | 🛠GitHub | scRNA-seq | Natural Language T5 and natural language instructions | Other: Ordering with embedding as natural language with additional terms | Encoder-Decoder, T5 | NTP with CE loss | None (conditional sequence generation, prompting) | Simulation, cell type annotation, drug sensitivity prediction |
MarkerGeneBERT | 📝Cheng et al. 2023 | None | scRNA-seq | Natural Language, PubMed and PubMed Central | Other: Natural language preprocessed with SciBERT | Encoder | MLM | Named Entity Recognition (NER), cell-biomarker sentence classification | None |
scELMo | 📝Liu, Chen and Zheng 2023 | Partial 🔍GitHub | scRNA-seq, CITE-seq | Natural Language, Closed source | Other: NLP model embeddings of features weighted by the feature level in a cell (e.g. expression level) | Closed source (some open) | Closed source (some open) | Cell type annotation, Genetic perturbation effect prediction | Cell and gene embeddings in other perturbation models |
GenePT | 📝Chen and Zou 2023 | Partial 🔍GitHub | scRNA-seq | Natural Language, Closed source | Ordering: embedding as natural language | Closed source | Closed source | Gene function prediction | Cell clustering, GRN inference |
GPT-4 | 📄(Nature Meth)W. Hou and Z. Ji 2024 | 🔍GitHub | scRNA-seq | Natural Language, Closed source | Ordering: embedding as natural language | Closed source | Closed source | None (conditional sequence generation, prompting) | Cell type annotation |
Cell2Sentence | 📄(ICML)Levine et al. 2024 | 🛠️GitHub | scRNA-seq | Natural Language (GPT2) and scRNA-seq (40k / immune, human) | Ordering: embedding as natural language | Decoder | NTP with CE loss | None | Simulation, cell type annotation |
Paper | Code | Omic Modalities | Evaluated Transformers | Tasks | Notes |
---|---|---|---|---|---|
📝He et al. 2024 | 🛠️GitHub | scRNA-seq | scGPT | Cell type annotation | Evaluation of Parameter-Efficient Fine-Tuning (PEFT) for scGPT. Indicates that PEFT not only is more compute-efficient, but also results in better cell type prediction. |
📄(Nature MI)Khan et al. 2023 | 🛠️GitHub | scRNA-seq | scBERT | Cell type annotation. Unseen cell type detection | Focused on imbalanced cell type classification. scBERT is sensitive to class imbalance. scBERT outperforms Seurat. scBERT doesn't perform well in unseen cell type detection. It benefits from SSL pretraining. |
📝Liu et al. 2023 | 🛠️GitHub | scRNA-seq, scATAC-seq, Spatial transcriptomics | scGPT, Geneformer, scBERT, tGPT, CellLM | Cell clustering, cell type annotation, multimodal embedding, GRN inference, gene expression imputation, genetic perturbation effect prediction, simulation, gene function prediction | Models aren't trained on the same datasets. scGPT is positioned as most versatile in terms of task diversity that it can tackle. Models other than transformer appear to be at least as good as transformers in most tasks. Transformers were shown to be sensitive to the choice of hyperparameters, such as learning rate and epochs. |
📝Boiarsky et al. 2023 | 🛠️GitHub | scRNA-seq | scBERT, scGPT | Cell type annotation | Logistic regression appears to be as good as transformers in cell type annotation, even in low-data scenarios. |
📝Kedzierska et al. 2023 | 🛠️GitHub | scRNA-seq | scGPT, Geneformer | Cell clustering | Zero-shot performance only. Both models appear unreliable. |
📝Alsabbagh et al. 2023 | 🛠️GitHub | scRNA-seq | scGPT, Geneformer, scBERT | Cell type annotation | Focused on imbalanced cell type classification. Geneformer appears to be outperformed by scGPT and scBERT, where the two latter perform similarly. |
- 📝 - Preprint
- 📄 - Peer-Reviewed Publication
- 🛠️ - Fully reproducible
- 🔍 - Code for evaluation only
- ❌ - Retracted or withdrawn
If you find the the data in this repository useful for your work, please cite:
@Article{szalata_transformers_2024,
title = {Transformers in single-cell omics: a review and new perspectives},
volume = {21},
issn = {1548-7105},
url = {https://doi.org/10.1038/s41592-024-02353-z},
doi = {10.1038/s41592-024-02353-z},
abstract = {Recent efforts to construct reference maps of cellular phenotypes have expanded the volume and diversity of single-cell omics data, providing an unprecedented resource for studying cell properties. Despite the availability of rich datasets and their continued growth, current single-cell models are unable to fully capitalize on the information they contain. Transformers have become the architecture of choice for foundation models in other domains owing to their ability to generalize to heterogeneous, large-scale datasets. Thus, the question arises of whether transformers could set off a similar shift in the field of single-cell modeling. Here we first describe the transformer architecture and its single-cell adaptations and then present a comprehensive review of the existing applications of transformers in single-cell analysis and critically discuss their future potential for single-cell biology. By studying limitations and technical challenges, we aim to provide a structured outlook for future research directions at the intersection of machine learning and single-cell biology.},
pages = {1430--1443},
number = {8},
journaltitle = {Nature Methods},
shortjournal = {Nature Methods},
author = {Szałata, Artur and Hrovatin, Karin and Becker, Sören and Tejada-Lapuerta, Alejandro and Cui, Haotian and Wang, Bo and Theis, Fabian J.},
date = {2024-08-01},}