Transformers in Single-Cell Omics

Note 🚧 This repository is under construction. This note will disappear as soon as all the all the single-cell transformer paper tables are added.

This repository accompanies Transformers in Single-Cell Omics: A Review and New Perspectives. Please refer to the manuscript for the details.

We provide a curated list of single-cell transformers and their evaluation results. We skip models that work only on bulk or images of slides data and those where transformers are used only as a part of the model. Models focusing on sequential data, such as DNA or protein sequences are omitted too. New entries are added at the top of the corresponding table.

We welcome contributions to this repository. Please open a pull request or an issue if you want to add or edit an entry.

Single-cell transformers

Model	Paper	Code	Omic Modalities	Pre-training Dataset	Input Embedding	Architecture	SSL Tasks	Supervised Tasks	Zero-shot Tasks
scMulan	📝Bian et al. 2024	🔍Github	scRNA-Seq	10M / cross-tissue, human (hECA)	Not specified	Decoder	Conditional cell generation	cell type annotation, cell metadata annotation (both also used in training)	Batch integration
BioFormers	📝Belgadi and Li et al. 2023	None	scRNA-Seq	8K / single tissue, human (PBMC, Adamson et al. 2016)	Value categorization: value binning	Encoder	MLM with CE loss	None	Cell clustering, gene expression imputation, genetic perturbation effect prediction, GRN inference
Geneformer	📄(Nature)Theodoris et al. 2023	🛠🤗	scRNA-Seq	36M / cross-tissue, human (Genecorpus)	Ordering: rank-based	Encoder	MLM with CE loss, gene ID prediction	Gene function prediction, cell annotation	Cell clustering, GRN inference
Universal Cell Embedding	📝Rosen et al. 2023	🔍Github	scRNA-Seq	36M / cross-tissue, cross-species (CELLxGENE and other)	Other: ESM-2 based gene embeddings. Gene embeddings are sampled according to expression levels and order determined by position on chromosomes.	Encoder	Modified MLM, binary CE loss predicting whether a gene is expressed or not. Uses CLS embedding instead of token-embeddings.	Cell annotation	Cell clustering, cross-species integration
scGPT	📄(Nature Meth)Cui et al. 2024	🔍GitHub	scRNA-Seq, scATAC-Seq, CITE-Seq, Spatial transcriptomics	33M / cross-tissue, human, non-disease (CELLxGENE)	Value categorization: value binning	Other: attention masking in encoder	Iterative MLM variant with MSE loss, cell token expression prediction, gene expression prediction	Cell type annotation, genetic perturbation effect prediction, reverse perturbation prediction, cell clustering, multimodal embedding, gene function prediction	Cell clustering, GRN inference, simulation, gene expression imputation
TOSICA	📄(Nature Comms)Chen et al. 2023	🛠️GitHub	scRNA-Seq	None	Value projection	Encoder	None	Cell type annotation	None
scMoFormer	📄(ACM)Tang et al. 2023	🛠️GitHub	scRNA-Seq, scATAC-Seq, CITE-Seq	None	Other, SVD-based	Encoder and graph transformers	None	Cross-modality prediction	None
tGPT	📄(Cell iScience)Shen et al. 2023	🛠GitHub️	scRNA-Seq	22M / cross-tissue, cross-species, disease and non-disease, organoids (list)	Ordering	Decoder	NTP with CE loss, gene ID prediction	None	Cell clustering, trajectory inference
SpaFormer	📝Wen et al. 2023	🛠️GitHub	Spatial transcriptomics	None	Cells as tokens, value projection	Encoder	Modified MLM with MSE loss, gene expression prediction	Gene expression imputation	Cell clustering
scFoundation	📝Hao et al. 2023 and Gong et al. 2023	🔍GitHub	scRNA-Seq	50M / cross-tissue, human, disease and non-disease (GEO, Single Cell Portal, HCA, EMBL-EBI)	Value projection	Other: two encoders	Modified MLM with MSE loss, gene expression prediction	Drug response prediction, genetic perturbation effect prediction	Read depth enhancement, cell clustering
CellLM	📝Zhao et al. 2023	🔍GitHub	scRNA-Seq	1.8M / cross-tissue, human, disease and non-disease (PanglaoDB, CancerSCEM)	Value categorization	Encoder	Contrastive loss, MLM with CE loss	Non-disease vs cancer prediction, cell type annotation, drug response prediction	None
scCLIP	📝Xiong et al. 2023	🛠️GitHub	scRNA-Seq, scATAC-seq	377k / cross-tissue, human fetal (ATAC, RNA)	Value projection	Encoder	Contrastive loss, CE matching modalities	None	Multimodal embedding
GeneCompass	📝Yang et al. 2023	GitHub, no code yet	scRNA-Seq	126M / cross-tissue, human and mouse, disease and non-disease (GEO, SRA, CELLxGENE, GSA, Single Cell Portal, HCA, EMBL-EBI, 3CA, Cell BLAST, TEDD, and other)	?	Other: two encoders	MLM with CE and MSE loss, gene ID and expression prediction	Cell type annotation, drug response prediction, gene function prediction	Cross-species integration, genetic perturbation effect prediction, GRN inference
CellPLM	📄(ICLR)Wen et al. 2023	Partial 🔍GitHub	scRNA-Seq, Spatial transcriptomics	11M / cross-tissue, human, disease and non-disease (HTCA, HCA, GEO)	Cells as tokens, value projection	Encoder	Modified MLM with MSE loss and KL losses, gene expression prediction	Gene expression imputation, cell type annotation, genetic perturbation effect prediction	Cell clustering, scRNA-Seq denoising
scMAE	📝Kim et al. 2023	None	single-cell flow cytometry	6.5M / human, disease and non-disease (source?)	Other, concatenation of values with learnable protein embeddings	Other: two encoders	MLM with MSE loss, protein expression prediction	Cell type annotation, protein expression imputation	None
CAN/CGRAN	📝Wang et al. 2023	None	scRNA-Seq	None	Value projection	Encoder	None	Cell type annotation	None
scTranslator	📝Liu et al. 2023	🔍️GitHub	scRNA-Seq, CITE-Seq	None	Value projection	Other: two encoders	None	Cross-modality prediction	(After cross-modality prediction training) GRN inference, cell clustering
scTransSort	📄(MDPI)Jiao et al. 2023	🛠️GitHub	scRNA-Seq	None	Value projection	Encoder	None	Cell type annotation	None
STGRNS	📄(OUP)Xu et al. 2023	🛠️GitHub	scRNA-Seq	None	Other	Encoder	None	GRN inference	None
CIForm	📄(OUP)Xu et al. 2023	🛠️GitHub	scRNA-Seq	None	Value projection	Encoder	None	Cell type annotation	None
scFormer	📝Cui et al. 2023	Incomplete ️GitHub	scRNA-Seq	Task specific	Value categorization: value binning	Encoder	Modified MLM with CE, cell token expression prediction, contrastive loss with cosine similarity, gene expression prediction	Cell type annotation, genetic perturbation effect prediction	Cell clustering
Exceiver	📝Connell et al. 2022	🛠️GitHub	scRNA-Seq	0.5M / cross-tissue, human (Tabula Sapiens)	Other: value scaled embeddings	Encoder	Modified MLM with MSE, gene expression prediction	Cell type annotation, drug response prediction	Cell clustering
TransCluster	📄(Frontiers)Song et al. 2022	🛠️GitHub	scRNA-Seq	None	Value projection with LDA	Encoder	None	Cell type annotation	None
scBERT	📄(Nature MI)Yang et al. 2022	🔍GitHub	scRNA-Seq	1M / cross-tissue, human (PanglaoDB)	Value categorization, binning	Encoder	MLM with CE loss, gene expression prediction	Cell type annotation, unseen cell type detection	None
iSEEEK	📄(OUP)Shen et al. 2022	🔍Github (dataset not public)	scRNA-Seq	11.9M / cross-tissue, cross-species (list)	Ordering: rank-based	Encoder	MLM with CE loss	Marker gene classification	Cell clustering, pseudotime analysis, GRN inference
Multitask learning	📝Pang et al. 2020	None	scRNA-Seq	160k / brain, mouse (MBA)	Value projection	Other: autoencoder with two transformer encoders (?)	Modified MLM with MSE loss, gene expression prediction	None	Cell clustering

Transformer LLMs for single-cell

Model	Paper	Code	Omic Modalities	Pre-training Dataset	Input Embedding	Architecture	SSL Tasks	Supervised Tasks	Zero-shot Tasks
scInterpreter	📝Li et al. 2024	None	scRNA-Seq	Natural Language GPT-3.5 and Llama-13b	Other: Ordering with embedding of the natural language representation	Decoder, GPT-3.5 and Llama-13b	NTP with CE loss and instruction finetuning (GPT-3.5 closed-source)	None	Cell type annotation (LLMs frozen, only small MLP trained)
ChatCell	📝❌Fang et al. 2024	🛠GitHub	scRNA-Seq	Natural Language T5 and natural language instructions	Other: Ordering with embedding as natural language with additional terms	Encoder-Decoder, T5	NTP with CE loss	None (conditional sequence generation, prompting)	Simulation, cell type annotation, drug sensitivity prediction
MarkerGeneBERT	📝Cheng et al. 2023	None	scRNA-Seq	Natural Language, PubMed and PubMed Central	Other: Natural language preprocessed with SciBERT	Encoder	MLM	Named Entity Recognition (NER), cell-biomarker sentence classification	None
scELMo	📝Liu, Chen and Zheng 2023	Partial 🔍GitHub	scRNA-Seq, CITE-Seq	Natural Language, Closed source	Other: NLP model embeddings of features weighted by the feature level in a cell (e.g. expression level)	Closed source (some open)	Closed source (some open)	Cell type annotation, Genetic perturbation effect prediction	Cell and gene embeddings in other perturbation models
GenePT	📝Chen and Zou 2023	Partial 🔍GitHub	scRNA-Seq	Natural Language, Closed source	Ordering: embedding as natural language	Closed source	Closed source	Gene function prediction	Cell clustering, GRN inference
GPT-4	📝Z. Ji and Hou 2023	None	scRNA-Seq	Natural Language, Closed source	Ordering: embedding as natural language	Closed source	Closed source	None (coditional sequence generation, prompting)	Cell type annotation
Cell2Sentence	📝Levine et al. 2023	🛠️GitHub	scRNA-Seq	Natural Language (GPT2) and scRNA-Seq (40k / immune, human)	Ordering: embedding as natural language	Decoder	NTP with CE loss	None	Simulation, cell type annotation

Single-cell transformer evaluation

Paper	Code	Omic Modalities	Evaluated Transformers	Tasks	Notes
📝He et al. 2024	🛠️GitHub	scRNA-Seq	scGPT	Cell type annotation	Evaluation of Parameter-Efficient Fine-Tuning (PEFT) for scGPT. Indicates that PEFT not only is more compute-efficient, but also results in better cell type prediction.
📄(Nature MI)Khan et al. 2023	🛠️GitHub	scRNA-Seq	scBERT	Cell type annotation. Unseen cell type detection	Focused on imbalanced cell type classification. scBERT is sensitive to class imbalance. scBERT outperforms Seurat. scBERT doesn't perform well in unseen cell type detection. It benefits from SSL pretraining.
📝Liu et al. 2023	🛠️GitHub	scRNA-Seq, scATAC-Seq, Spatial transcriptomics	scGPT, Geneformer, scBERT, tGPT, CellLM	Cell clustering, cell type annotation, multimodal embedding, GRN inference, gene expression imputation, genetic perturbation effect prediction, simulation, gene function prediction	Models aren't trained on the same datasets. scGPT is positioned as most versatile in terms of task diversity that it can tackle. Models other than transformer appear to be at least as good as transformers in most tasks. Transformers were shown to be sensitive to the choice of hyperparameters, such as learning rate and epochs.
📝Boiarsky et al. 2023	🛠️GitHub	scRNA-Seq	scBERT, scGPT	Cell type annotation	Logistic regression appears to be as good as transformers in cell type annotation, even in low-data scenarios.
📝Kedzierska et al. 2023	🛠️GitHub	scRNA-Seq	scGPT, Geneformer	Cell clustering	Zero-shot performance only. Both models appear unreliable.
📝Alsabbagh et al. 2023	🛠️GitHub	scRNA-Seq	scGPT, Geneformer, scBERT	Cell type annotation	Focused on imbalanced cell type classification. Geneformer appears to be outperformed by scGPT and scBERT, where the two latter perform similarly.

Legend

📝 - Preprint
📄 - Peer-Reviewed Publication
🛠️ - Fully reproducible
🔍 - Code for evaluation only
❌ - Retracted or withdrawn

Citing this work

If you find the the data in this repository useful for your work, please cite:

@Article{TBA}

marioernestovaldes/single-cell-transformer-papers