Data-Centric Foundation Models in Computational Healthcare

🔥🔥🔥 A survey on data-centric foundation models in computational healthcare

Project Page | Paper [arXiv]

Last updated: 2024/07/16

📝 If you find this repo helps, please kindly cite our survey, thanks!

@article{zhang2024data,
  title={Data-Centric Foundation Models in Computational Healthcare: A Survey},
  author={Zhang, Yunkun and Gao, Jin and Tan, Zheling and Zhou, Lingfeng and Ding, Kexin and Zhou, Mu and Zhang, Shaoting and Wang, Dequan},
  journal={arXiv preprint arXiv:2401.02458},
  year={2024}
}

In this repository, we provide an up-to-date list of healthcare-related foundation models and datasets, which are also mentioned in our survey paper.

📖 Contents


Healthcare and Medical Foundation Models

A star (*) after the pre-training data shows that the authors constructed the data with more than three sources.

Language Models

Model Subfield Paper Code Base Pre-Training Data
MMedLM 2 Medicine Towards Building Multilingual Language Model for Medicine Github InternLM 2 MMedC*
BiMediX Medicine BiMediX: Bilingual Medical Mixture of Experts LLM Github Mixtral BiMed1.3M*
Me LLaMA Medicine Me LLaMA: Foundation Large Language Models for Medical Applications Github LLaMA 2 *
BioMistral Biomedicine BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains - Mistral PMC
PULSE Medicine - Github InternLM *
Meditron Medicine Meditron-70B: Scaling Medical Pretraining for Large Language Models Github LLaMA 2 GAP-Replay*
Taiyi Biomedicine Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks Github Qwen BigBio + CBLUE
BioMedGPT Biomedicine BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine Github LLaMA 2 S2ORC
Clinical LLaMA-LoRA Clinic Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain - LLaMA MIMIC-IV
Med-PaLM 2 Clinic Towards Expert-Level Medical Question Answering with Large Language Models Google PaLM 2 MultiMedQA
PMC-LLaMA Medicine PMC-LLaMA: Towards Building Open-source Language Models for Medicine Github LLaMA MedC
MedAlpaca Medicine MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data Github LLaMA Medical Meadow
BenTsao (HuaTuo) Biomedicine HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge Github LLaMA CMeKG
ChatDoctor Medicine ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge Github LLaMA HealthCareMagic*
Clinical-T5 Clinic Clinical-T5: Large Language Models Built Using Mimic Clinical Text PhysioNet T5 MIMIC-III + MIMIC-IV
Med-PaLM Clinic Large Language Models Encode Clinical Knowledge Google PaLM MultiMedQA
BioGPT Biomedicine BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining Github GPT-2 PubMed
BioLinkBERT Biomedicine Linkbert: Pretraining Language Models with Document Links Github BERT PubMed
PubMedBERT Biomedicine Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing Microsoft BERT PubMed
BioBERT Biomedicine BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining Github BERT PubMed + PMC
BlueBERT Biomedicine An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining Github BERT PubMed + MIMIC-III
Clinical BERT Clinic Publicly Available Clinical BERT Embeddings Github BERT MIMIC-III
SciBERT Biomedicine SciBERT: A Pretrained Language Model for Scientific Text Github BERT Semantic Scholar

Vision Models

Model Subfield Paper Code Base Pre-Training Data
Prov-GigaPath Pathology A Whole-Slide Foundation Model for Digital Pathology from Real-World Data Github - Prov-Path*
BEPH Pathology A Foundation Model for Generalizable Cancer Diagnosis and Survival Prediction from Histopathological Images Github BEiTv2 *
(No name) Radiology Foundation Model for Cancer Imaging Biomarkers Github SimCLR *
VISION-MAE Radiology VISION-MAE: A Foundation Model for Medical Image Segmentation and Classification - MAE *
RudolfV Pathology RudolfV: A Foundation Model by Pathologists for Pathologists - DINOv2 *
PathoDuet Pathology PathoDuet: Foundation Models for Pathological Slide Analysis of H&E and ICH Stains Github MoCo v3 TCGA + HyReCo + BCI
UNI Pathology A General-Purpose Self-Supervised Model for Computational Pathology - DINOv2 Mass-100K
REMEDIS Radiology Robust and Data-Efficient Generalization of Self-Supervised Machine Learning for Diagnostic Imaging Github SimCLR MIMIC-IV + CheXpert
Virchow Pathology Virchow: A Million-Slide Digital Pathology Foundation Model - DINOv2 *
RETFound Retinopathy A Foundation Model for Generalizable Disease Detection from Retinal Images Github MAE *
CTransPath Pathology Transformer-Based Unsupervised Contrastive Learning for Histopathological Image Classification Github - TCGA + PAIP
HIPT Pathology Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning Github DINO TCGA

Vision-Language Models

Model Subfield Paper Code Base Pre-Training Data
PRISM Pathology PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology - CoCa *
EchoCLIP Cardiology Vision-Language Foundation Model for Echocardiogram Interpretation Github CLIP *
ChemDFM Chemistry ChemDFM: Dialogue Foundation Model for Chemistry - LLaMA PubMed + USPTO
CheXagent Radiology CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation Github BLIP-2 CheXinstruct*
SAT Radiology One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts Github - SAT-DS*
PathChat Pathology A Foundational Multimodal Vision Language AI Assistant for Human Pathology - LLaVA PathChatInstruct*
Qilin-Med-VL Radiology Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare Github LLaVA Chi-Med-VL*
CXR-CLIP Radiology CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training Github CLIP MIMIC-CXR + CheXpert + ChestX-ray14
MaCo Radiology Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning - MAE + CLIP MIMIC-CXR
PathLDM Pathology PathLDM: Text conditioned Latent Diffusion Model for Histopathology Github Latent Diffusion TCGA-BRCA + GPT-3.5
RadFM Radiology Towards Generalist Foundation Model for Radiology Github - MedMD*
KAD Radiology Knowledge-Enhanced Visual-Language Pre-Training on Chest Radiology Images Github CLIP MIMIC-CXR + UMLS
Med-Flamingo Medicine Med-Flamingo: A Multimodal Medical Few-Shot Learner Github Flamingo MTB + PMC-OA
CONCH Pathology A Visual-Language Foundation Model for Computational Pathology Github CoCa PubMed + PMC
QuiltNet Pathology Quilt-1M: One Million Image-Text Pairs for Histopathology Github CLIP Quilt-1M*
PathAsst Pathology PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology Github CLIP PathCap + PathInstruct*
PLIP Pathology A Visual-Language Foundation Model for Pathology Image Analysis Using Medical Twitter Huggingface CLIP OpenPath*
MI-Zero Pathology Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images Github CLIP ARCH
LLaVA-Med Biomedicine LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day Github LLaVA PMC-15M + GPT-4
MedVInT Biomedicine PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering Github - PMC-VQA*
PMC-CLIP Biomedicine PMC-CLIP: Contrastive Language-Image Pre-Training Using Biomedical Documents Github CLIP PMC-OA*
BiomedCLIP Biomedicine Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing Huggingface CLIP PMC-15M*
MedKLIP Radiology MedKLIP: Medical Knowledge Eenhanced Language-Image Pre-Training Github CLIP MIMIC-CXR
MedCLIP Medicine MedCLIP: Contrastive Learning from Unpaired Medical Images and Text Github CLIP CheXpert + MIMIC-CXR
CheXzero Radiology Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning Github CLIP MIMIC-CXR
PubMedCLIP Radiology Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain? Github CLIP ROCO

Protein and Molecule Models

Model Subfield Paper Code Base Pre-Training Data
nach0 Molecules nach0: Multimodal Natural and Chemical Languages Foundation Model Github T5 *
MoleculeSTM Drug Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing Github CLIP PubChem
AlphaMissense Proteomics Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense Github AlphaFold PDB + UniRef
GET Genomics GET: A Foundation Model of Transcription across Human Cell Types Huggingface Transformer *
GIT-Mol Molecules GIT-Mol: A Multi-Modal Large Language Model for Molecular Science with Graph, Image, and Text Github T5 + BLIP-2 PubChem
ESM-2 Proteomics Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model Github Transformer UniRef
AlphaFold 2 Proteomics Highly Accurate Protein Structure Prediction with AlphaFold Github - PDB + Uniclust30

Other Models

Model Subfield Paper Code Base Pre-Training Data
OmniNA Nucleotide sequence OmniNA: A Foundation Model for Nucleotide Sequences - LLaMA NCBI
LaBraM EEG Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI - Transformer *
Neuro-GPT EEG Neuro-GPT: Developing A Foundation Model for EEG - - TUH EEG

Datasets for Foundation Model

Text

Dataset (Paper) Description Link
MMedBench (arXiv) A multilingual medical QA benchmark, where questions are categorized into 21 topics Github
MMedC (arXiv) A multilingual medical corpus containing over 25.5B tokens Github
BiMed1.3M (arXiv) An English and Arabic bilingual dataset of 1.3M samples of medical QA and chat Github
GAP-Replay (arXiv) 48.1B tokens from 4 medical corpora including guidelines, abstracts, papers, and replay Github
Huatuo-26M (arXiv) 26M Chinese medical QA pairs Github
Medical Meadow (arXiv) 16M medical QA pairs collected from 9 sources Github
MultiMedQA (Nature) 6 existing and 1 online-collected medical QA dataset Nature
BigBio (Nature) 126+ biomedical NLP datasets covering 13 task categories and 10+ languages Github
MedMCQA (MLR) 194K multiple-choice questions covering 2.4K healthcare topics Official site
MedQA-USMLE (MDPI) 61,097 multiple choice questions based on USMLE in three languages Github
CBLUE (arXiv) A Chinese biomedical language understanding evaluation benchmark with 18 datasets Official site
BLURB (arXiv) 13 biomedical NLP datasets in 6 tasks Official site
PubMedQA (arXiv) 1K expert-annotated, 61.2K unlabeled, and 211.3K artificially generated biomedical QA instances Official site
BLUE (arXiv) 5 language tasks with 10 biomedical and clinical text datasets Github
webMedQA (BMC) 63,284 real-world Chinese medical questions with over 300K answers Github
MedMentions (arXiv) 4,392 papers annotated by experts with mentions of UMLS entities Github
MIMIC-III (Nature) Critical care data for over 40,000 patients Official site
ClinicalTrials.gov An online database of clinical research studies, including clinical trials and observational studies Official site

Imaging

Dataset (Paper) Description Link
Mass-100K (arXiv) 100M tissue patches from 100,426 diagnostic H&E WSIs accross 20 major tissue types -
RETFound (Nature) Unannotated retinal images, containing 904,170 CFPs and 736,442 OCT scans Nature
AbdomenAtlas-8K (arXiv) 8,448 CT volumes with per-voxel annotated eight abdominal organs Github
Med-MNIST v2 (Nature) 12 2D and 6 3D datasets for biomedical image classification Official site
EchoNet-Dynamic (Nature) 10,030 expert-annotated echocardiogram videos Official site
CheXpert (arXiv) 224,316 chest radiographs of 65,240 patients Official site
Kather Colon Dataset (PMC) 100K histological images of human colorectal cancer and healthy tissue Zenodo
DeepLesion (PMC) 32K CT scans with annotations and semantic labels from radiological reports NIH
ChestXray-NIHCC (arXiv) 100K radiographs with labels from more than 30,000 patients NIH
ISIC An archive containing 23K skin lesion images with labels & Imaging Official site

Genomics

Dataset (Paper) Description Link
1000 Genomes Project (Nature) A comprehensive catalog of human genetic variations Official site
ENCODE (Nature) A platform of genomics data and encyclopedia with integrative-level and ground-level annotations NIH
dbSNP (NIH) A collection of human single nucleotide variations, microsatellites, and small-scale insertions and deletions NIH

Drug

Dataset (Paper) Description Link
DrugChat (arXiv) 143,517 question-answer pairs covering 10,834 drug compounds, collected from PubChem and ChEMBL Github
PubChem (NIH) A collection of 900+ sources of chemical information data NIH
DrugBank (NIH) A web-enabled structured database of molecular information about drugs Official site
ChEMBL (NIH) 20M bioactivity measurements for 2.4M distinct compounds and 15K protein targets Official site

Mulit-Modal

Dataset (Paper) Description Link
RadGenome-Chest CT (arXiv) A dataset of 3D chest CT, including 197 organ-level segmentation masks, 665K multi-granularity grounded reports, and 1.3M grounded VQA pairs -
OmniMedVQA (arXiv) 131,813 question-answering items with 120,530 images from 12 modalities and 26 human anatomical regions, collected from 75 medical datasets -
SAT-DS (arXiv) 11,462 scans with 142,254 segmentation annotations spanning 8 human body regions from 31 medical image segmentation datasets, together with domain knowledge from e-Anatomy and UMLS Github
PathChatInstruct (arXiv) 257,004 instructions of pathology-specific queries with image and text -
Chi-Med-VL (arXiv) 580,014 image-text pairs and 469,441 question-answer pairs for general healthcare in Chinese Github
MedMD (arXiv) 15.5M 2D scans and 180k 3D radiology scans with textual descriptions Github
OpenPath (Nature) 208,414 pathology images paired with natural language descriptions Huggingface
Quilt-1M (arXiv) 1M image-text pairs for histopathology Github
Med-MMHL (arXiv) Human- and LLM-generated misinformation detection dataset Github
Mol-Instructions (arXiv) 148K molecule-oriented, 505K protein-oriented, and biomolecular text instructions Huggingface
PathInstruct (arXiv) 180K samples of LLM-generated instruction-following data Github
PMC-VQA (arXiv) 227K VQA pairs of 149K images of various modalities or diseases Github
PMC-OA (arXiv) 1.6M fine-grained biomedical image-text pairs Github
PathCap (arXiv) 142K pathology image-caption pairs from various sources Github
SwissProtCLAP (arXiv) 441K text-protein sequence pairs Github
MIMIC-IV (Nature) Clinical information for hospital stays of over 60,000 patients Official site
MIMIC-CXR (Nature) 227,835 chest imaging studies with free-text reports for 65,379 patients PhysioNet
TCGA A landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types Official site