Transformers in Single-Cell Omics

This repository accompanies Transformers in Single-Cell Omics: A Review and New Perspectives. Please refer to the manuscript for the details.

We provide a curated list of single-cell transformers and their evaluation results. We skip models that work only on bulk or images of slides data and those where transformers are used only as a part of the model. Models focusing on sequential data, such as DNA or protein sequences are omitted too. New entries are added at the top of the corresponding table.

We welcome contributions to this repository. Please open a pull request or an issue if you want to add or edit an entry.

Single-cell transformers

Model Paper Code Omic Modalities Pre-training Dataset Input Embedding Architecture SSL Tasks Supervised Tasks Zero-shot Tasks
Precious3GPT 📝Galkin et al. 2024 Partial 🔍️🤗 Bulk/scRNA-seq, DNAm, proteomics, natural language annotations Omics data with KG and text embeddings, Closed source ? Decoder-only LLaMA-like transformer model with modality mapper units Emulation of chemical response, cross-species/tissue/omics transference, emulation of clinical conditions Age prediction, gene classification DEG prediction, Indication discovery
LangCell 📄(ICML)Zhao et al. 2024 🛠️Github scRNA-seq, natural language 27M / cross-tissue, human (CELLxGENE) Ordering: rank-based, natural language cell description Other: two encoders (cell and text) MLM with CE loss, intra- and inter-modal contrastive loss, cell-text matching with CE loss Cell type annotation, pathway identification Novel cell type identification, NSCLC subtype classification, batch integration, cell clustering
ScRAT 📄(Bioinformatics)Mao et al. 2024 🛠️GitHub scRNA-seq None Cells as tokens Encoder None Phenotype prediction: aggregated per sample cell embeddings are used to predict sample label (e.g., health condition) None
scPRINT 📝Kalfon et al. 2024 🛠️Github scRNA-seq 50M / cross-tissue, cross-species (CELLxGENE) Other: ESM-2 based gene embeddings. Gene embeddings are randomly sampled and order determined by position on chromosomes Encoder Multi task Pre-training: Denoising, Botleneck learning (+ many additional losses available) Cell label prediction (these supervised tasks are part of the pre-training) Read depth enhancement, gene expression imputation, Batch Integration, Cell Clustering, Cell Label Prediction, GRN inference
scMulan 📄(RECOMB)Bian et al. 2024 🔍Github scRNA-seq 10M / cross-tissue, human (hECA) Not specified Decoder Conditional cell generation cell type annotation, cell metadata annotation (both also used in training) Batch integration
BioFormers 📝Belgadi and Li et al. 2023 None scRNA-seq 8K / single tissue, human (PBMC, Adamson et al. 2016) Value categorization: value binning Encoder MLM with CE loss None Cell clustering, gene expression imputation, genetic perturbation effect prediction, GRN inference
Geneformer 📄(Nature)Theodoris et al. 2023 🛠🤗 scRNA-seq 36M / cross-tissue, human (Genecorpus) Ordering: rank-based Encoder MLM with CE loss, gene ID prediction Gene function prediction, cell annotation Cell clustering, GRN inference
Universal Cell Embedding 📝Rosen et al. 2023 🔍Github scRNA-seq 36M / cross-tissue, cross-species (CELLxGENE and other) Other: ESM-2 based gene embeddings. Gene embeddings are sampled according to expression levels and order determined by position on chromosomes. Encoder Modified MLM, binary CE loss predicting whether a gene is expressed or not. Uses CLS embedding instead of token-embeddings. Cell annotation Cell clustering, cross-species integration
scGPT 📄(Nature Meth)Cui et al. 2024 🔍GitHub scRNA-seq, scATAC-seq, CITE-seq, Spatial transcriptomics 33M / cross-tissue, human, non-disease (CELLxGENE) Value categorization: value binning Other: attention masking in encoder Iterative MLM variant with MSE loss, cell token expression prediction, gene expression prediction Cell type annotation, genetic perturbation effect prediction, reverse perturbation prediction, cell clustering, multimodal embedding, gene function prediction Cell clustering, GRN inference, simulation, gene expression imputation
TOSICA 📄(Nature Comms)Chen et al. 2023 🛠️GitHub scRNA-seq None Value projection Encoder None Cell type annotation None
scMoFormer 📄(ACM)Tang et al. 2023 🛠️GitHub scRNA-seq, scATAC-seq, CITE-seq None Other, SVD-based Encoder and graph transformers None Cross-modality prediction None
tGPT 📄(Cell iScience)Shen et al. 2023 🛠GitHub️ scRNA-seq 22M / cross-tissue, cross-species, disease and non-disease, organoids (list) Ordering Decoder NTP with CE loss, gene ID prediction None Cell clustering, trajectory inference
SpaFormer 📝Wen et al. 2023 🛠️GitHub Spatial transcriptomics None Cells as tokens, value projection Encoder Modified MLM with MSE loss, gene expression prediction Gene expression imputation Cell clustering
scFoundation 📄(Nature Meth)Hao et al. 2024 and 📄(NeurIPS)Gong et al. 2023 🔍GitHub scRNA-seq 50M / cross-tissue, human, disease and non-disease (GEO, Single Cell Portal, HCA, EMBL-EBI) Value projection Other: two encoders Modified MLM with MSE loss, gene expression prediction Drug response prediction, genetic perturbation effect prediction Read depth enhancement, cell clustering
CellLM 📝Zhao et al. 2023 🔍GitHub scRNA-seq 1.8M / cross-tissue, human, disease and non-disease (PanglaoDB, CancerSCEM) Value categorization Encoder Contrastive loss, MLM with CE loss Non-disease vs cancer prediction, cell type annotation, drug response prediction None
scCLIP 📝Xiong et al. 2023 🛠️GitHub scRNA-seq, scATAC-seq 377k / cross-tissue, human fetal (ATAC, RNA) Value projection Encoder Contrastive loss, CE matching modalities None Multimodal embedding
GeneCompass 📝Yang et al. 2023 Partial 🛠 GitHub scRNA-seq 126M / cross-tissue, human and mouse, disease and non-disease (GEO, SRA, CELLxGENE, GSA, Single Cell Portal, HCA, EMBL-EBI, 3CA, Cell BLAST, TEDD, and other) ? Other: two encoders MLM with CE and MSE loss, gene ID and expression prediction Cell type annotation, drug response prediction, gene function prediction Cross-species integration, genetic perturbation effect prediction, GRN inference
CellPLM 📄(ICLR)Wen et al. 2024 Partial 🔍GitHub scRNA-seq, Spatial transcriptomics 11M / cross-tissue, human, disease and non-disease (HTCA, HCA, GEO) Cells as tokens, value projection Encoder Modified MLM with MSE loss and KL losses, gene expression prediction Gene expression imputation, cell type annotation, genetic perturbation effect prediction Cell clustering, scRNA-seq denoising
scMAE 📝Kim et al. 2023 None single-cell flow cytometry 6.5M / human, disease and non-disease (source?) Other, concatenation of values with learnable protein embeddings Other: two encoders MLM with MSE loss, protein expression prediction Cell type annotation, protein expression imputation None
CAN/CGRAN 📄(Frontiers)Wang et al. 2023 None scRNA-seq None Value projection Encoder None Cell type annotation None
scTranslator 📝Liu et al. 2023 🔍️GitHub scRNA-seq, CITE-seq None Value projection Other: two encoders None Cross-modality prediction (After cross-modality prediction training) GRN inference, cell clustering
scTransSort 📄(MDPI)Jiao et al. 2023 🛠️GitHub scRNA-seq None Value projection Encoder None Cell type annotation None
STGRNS 📄(OUP)Xu et al. 2023 🛠️GitHub scRNA-seq None Other Encoder None GRN inference None
CIForm 📄(OUP)Xu et al. 2023 🛠️GitHub scRNA-seq None Value projection Encoder None Cell type annotation None
scFormer 📝Cui et al. 2023 Incomplete ️GitHub scRNA-seq Task specific Value categorization: value binning Encoder Modified MLM with CE, cell token expression prediction, contrastive loss with cosine similarity, gene expression prediction Cell type annotation, genetic perturbation effect prediction Cell clustering
Exceiver 📝Connell et al. 2022 🛠️GitHub scRNA-seq 0.5M / cross-tissue, human (Tabula Sapiens) Other: value scaled embeddings Encoder Modified MLM with MSE, gene expression prediction Cell type annotation, drug response prediction Cell clustering
TransCluster 📄(Frontiers)Song et al. 2022 🛠️GitHub scRNA-seq None Value projection with LDA Encoder None Cell type annotation None
scBERT 📄(Nature MI)Yang et al. 2022 🔍GitHub scRNA-seq 1M / cross-tissue, human (PanglaoDB) Value categorization, binning Encoder MLM with CE loss, gene expression prediction Cell type annotation, unseen cell type detection None
iSEEEK 📄(OUP)Shen et al. 2022 🔍Github (dataset not public) scRNA-seq 11.9M / cross-tissue, cross-species (list) Ordering: rank-based Encoder MLM with CE loss Marker gene classification Cell clustering, pseudotime analysis, GRN inference
Multitask learning 📝Pang et al. 2020 None scRNA-seq 160k / brain, mouse (MBA) Value projection Other: autoencoder with two transformer encoders (?) Modified MLM with MSE loss, gene expression prediction None Cell clustering

Transformer LLMs for single-cell

Model Paper Code Omic Modalities Pre-training Dataset Input Embedding Architecture SSL Tasks Supervised Tasks Zero-shot Tasks
CELLama 📝Choi et al. 2024 🛠GitHub scRNA-seq, Spatial transcriptomics Natural Language SBERT Other: Ordering with embedding of the natural language representation, additional cell annotations are added in natural language Siamese encoders (SBERT) Contrastive loss Cell type annotation Cell type annotation, niche cell type featuring
CellWhisperer 📝Schaefer et al. 2024 Soon Bulk/scRNA-seq Transcriptome data paired with natural language annotations Geneformer- and BioBERT-based embedding models (contrastively fine-tuned) Multimodal contrastive training of embedding models (CLIP) and transcriptome instruction fine-tuning of LLM (LLaVA) None Transcriptome-aware question-answering Reference-free cell property prediction (cell types & states, disease states, organ of cell origin, ...)
scInterpreter 📝Li et al. 2024 None scRNA-seq Natural Language GPT-3.5 and Llama-13b Other: Ordering with embedding of the natural language representation Decoder, GPT-3.5 and Llama-13b NTP with CE loss and instruction finetuning (GPT-3.5 closed-source) None Cell type annotation (LLMs frozen, only small MLP trained)
ChatCell 📝❌Fang et al. 2024 🛠GitHub scRNA-seq Natural Language T5 and natural language instructions Other: Ordering with embedding as natural language with additional terms Encoder-Decoder, T5 NTP with CE loss None (conditional sequence generation, prompting) Simulation, cell type annotation, drug sensitivity prediction
MarkerGeneBERT 📝Cheng et al. 2023 None scRNA-seq Natural Language, PubMed and PubMed Central Other: Natural language preprocessed with SciBERT Encoder MLM Named Entity Recognition (NER), cell-biomarker sentence classification None
scELMo 📝Liu, Chen and Zheng 2023 Partial 🔍GitHub scRNA-seq, CITE-seq Natural Language, Closed source Other: NLP model embeddings of features weighted by the feature level in a cell (e.g. expression level) Closed source (some open) Closed source (some open) Cell type annotation, Genetic perturbation effect prediction Cell and gene embeddings in other perturbation models
GenePT 📝Chen and Zou 2023 Partial 🔍GitHub scRNA-seq Natural Language, Closed source Ordering: embedding as natural language Closed source Closed source Gene function prediction Cell clustering, GRN inference
GPT-4 📄(Nature Meth)W. Hou and Z. Ji 2024 🔍GitHub scRNA-seq Natural Language, Closed source Ordering: embedding as natural language Closed source Closed source None (conditional sequence generation, prompting) Cell type annotation
Cell2Sentence 📄(ICML)Levine et al. 2024 🛠️GitHub scRNA-seq Natural Language (GPT2) and scRNA-seq (40k / immune, human) Ordering: embedding as natural language Decoder NTP with CE loss None Simulation, cell type annotation

Single-cell transformer evaluation

Paper Code Omic Modalities Evaluated Transformers Tasks Notes
📝He et al. 2024 🛠️GitHub scRNA-seq scGPT Cell type annotation Evaluation of Parameter-Efficient Fine-Tuning (PEFT) for scGPT. Indicates that PEFT not only is more compute-efficient, but also results in better cell type prediction.
📄(Nature MI)Khan et al. 2023 🛠️GitHub scRNA-seq scBERT Cell type annotation. Unseen cell type detection Focused on imbalanced cell type classification. scBERT is sensitive to class imbalance. scBERT outperforms Seurat. scBERT doesn't perform well in unseen cell type detection. It benefits from SSL pretraining.
📝Liu et al. 2023 🛠️GitHub scRNA-seq, scATAC-seq, Spatial transcriptomics scGPT, Geneformer, scBERT, tGPT, CellLM Cell clustering, cell type annotation, multimodal embedding, GRN inference, gene expression imputation, genetic perturbation effect prediction, simulation, gene function prediction Models aren't trained on the same datasets. scGPT is positioned as most versatile in terms of task diversity that it can tackle. Models other than transformer appear to be at least as good as transformers in most tasks. Transformers were shown to be sensitive to the choice of hyperparameters, such as learning rate and epochs.
📝Boiarsky et al. 2023 🛠️GitHub scRNA-seq scBERT, scGPT Cell type annotation Logistic regression appears to be as good as transformers in cell type annotation, even in low-data scenarios.
📝Kedzierska et al. 2023 🛠️GitHub scRNA-seq scGPT, Geneformer Cell clustering Zero-shot performance only. Both models appear unreliable.
📝Alsabbagh et al. 2023 🛠️GitHub scRNA-seq scGPT, Geneformer, scBERT Cell type annotation Focused on imbalanced cell type classification. Geneformer appears to be outperformed by scGPT and scBERT, where the two latter perform similarly.

Legend

  • 📝 - Preprint
  • 📄 - Peer-Reviewed Publication
  • 🛠️ - Fully reproducible
  • 🔍 - Code for evaluation only
  • ❌ - Retracted or withdrawn

Citing this work

If you find the the data in this repository useful for your work, please cite:

@Article{szalata_transformers_2024,
	title = {Transformers in single-cell omics: a review and new perspectives},
	volume = {21},
	issn = {1548-7105},
	url = {https://doi.org/10.1038/s41592-024-02353-z},
	doi = {10.1038/s41592-024-02353-z},
	abstract = {Recent efforts to construct reference maps of cellular phenotypes have expanded the volume and diversity of single-cell omics data, providing an unprecedented resource for studying cell properties. Despite the availability of rich datasets and their continued growth, current single-cell models are unable to fully capitalize on the information they contain. Transformers have become the architecture of choice for foundation models in other domains owing to their ability to generalize to heterogeneous, large-scale datasets. Thus, the question arises of whether transformers could set off a similar shift in the field of single-cell modeling. Here we first describe the transformer architecture and its single-cell adaptations and then present a comprehensive review of the existing applications of transformers in single-cell analysis and critically discuss their future potential for single-cell biology. By studying limitations and technical challenges, we aim to provide a structured outlook for future research directions at the intersection of machine learning and single-cell biology.},
	pages = {1430--1443},
	number = {8},
	journaltitle = {Nature Methods},
	shortjournal = {Nature Methods},
	author = {Szałata, Artur and Hrovatin, Karin and Becker, Sören and Tejada-Lapuerta, Alejandro and Cui, Haotian and Wang, Bo and Theis, Fabian J.},
	date = {2024-08-01},}