🔥🔥🔥 A survey on data-centric foundation models in computational healthcare
Last updated: 2024/07/16
📝 If you find this repo helps, please kindly cite our survey, thanks!
@article{zhang2024data,
title={Data-Centric Foundation Models in Computational Healthcare: A Survey},
author={Zhang, Yunkun and Gao, Jin and Tan, Zheling and Zhou, Lingfeng and Ding, Kexin and Zhou, Mu and Zhang, Shaoting and Wang, Dequan},
journal={arXiv preprint arXiv:2401.02458},
year={2024}
}
In this repository, we provide an up-to-date list of healthcare-related foundation models and datasets, which are also mentioned in our survey paper.
📖 Contents
A star (*) after the pre-training data shows that the authors constructed the data with more than three sources.
Model | Subfield | Paper | Code | Base | Pre-Training Data |
---|---|---|---|---|---|
nach0 | Molecules | nach0: Multimodal Natural and Chemical Languages Foundation Model | Github | T5 | * |
MoleculeSTM | Drug | Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing | Github | CLIP | PubChem |
AlphaMissense | Proteomics | Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense | Github | AlphaFold | PDB + UniRef |
GET | Genomics | GET: A Foundation Model of Transcription across Human Cell Types | Huggingface | Transformer | * |
GIT-Mol | Molecules | GIT-Mol: A Multi-Modal Large Language Model for Molecular Science with Graph, Image, and Text | Github | T5 + BLIP-2 | PubChem |
ESM-2 | Proteomics | Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model | Github | Transformer | UniRef |
AlphaFold 2 | Proteomics | Highly Accurate Protein Structure Prediction with AlphaFold | Github | - | PDB + Uniclust30 |
Model | Subfield | Paper | Code | Base | Pre-Training Data |
---|---|---|---|---|---|
OmniNA | Nucleotide sequence | OmniNA: A Foundation Model for Nucleotide Sequences | - | LLaMA | NCBI |
LaBraM | EEG | Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI | - | Transformer | * |
Neuro-GPT | EEG | Neuro-GPT: Developing A Foundation Model for EEG | - | - | TUH EEG |
Dataset (Paper) | Description | Link |
---|---|---|
MMedBench (arXiv) | A multilingual medical QA benchmark, where questions are categorized into 21 topics | Github |
MMedC (arXiv) | A multilingual medical corpus containing over 25.5B tokens | Github |
BiMed1.3M (arXiv) | An English and Arabic bilingual dataset of 1.3M samples of medical QA and chat | Github |
GAP-Replay (arXiv) | 48.1B tokens from 4 medical corpora including guidelines, abstracts, papers, and replay | Github |
Huatuo-26M (arXiv) | 26M Chinese medical QA pairs | Github |
Medical Meadow (arXiv) | 16M medical QA pairs collected from 9 sources | Github |
MultiMedQA (Nature) | 6 existing and 1 online-collected medical QA dataset | Nature |
BigBio (Nature) | 126+ biomedical NLP datasets covering 13 task categories and 10+ languages | Github |
MedMCQA (MLR) | 194K multiple-choice questions covering 2.4K healthcare topics | Official site |
MedQA-USMLE (MDPI) | 61,097 multiple choice questions based on USMLE in three languages | Github |
CBLUE (arXiv) | A Chinese biomedical language understanding evaluation benchmark with 18 datasets | Official site |
BLURB (arXiv) | 13 biomedical NLP datasets in 6 tasks | Official site |
PubMedQA (arXiv) | 1K expert-annotated, 61.2K unlabeled, and 211.3K artificially generated biomedical QA instances | Official site |
BLUE (arXiv) | 5 language tasks with 10 biomedical and clinical text datasets | Github |
webMedQA (BMC) | 63,284 real-world Chinese medical questions with over 300K answers | Github |
MedMentions (arXiv) | 4,392 papers annotated by experts with mentions of UMLS entities | Github |
MIMIC-III (Nature) | Critical care data for over 40,000 patients | Official site |
ClinicalTrials.gov | An online database of clinical research studies, including clinical trials and observational studies | Official site |
Dataset (Paper) | Description | Link |
---|---|---|
Mass-100K (arXiv) | 100M tissue patches from 100,426 diagnostic H&E WSIs accross 20 major tissue types | - |
RETFound (Nature) | Unannotated retinal images, containing 904,170 CFPs and 736,442 OCT scans | Nature |
AbdomenAtlas-8K (arXiv) | 8,448 CT volumes with per-voxel annotated eight abdominal organs | Github |
Med-MNIST v2 (Nature) | 12 2D and 6 3D datasets for biomedical image classification | Official site |
EchoNet-Dynamic (Nature) | 10,030 expert-annotated echocardiogram videos | Official site |
CheXpert (arXiv) | 224,316 chest radiographs of 65,240 patients | Official site |
Kather Colon Dataset (PMC) | 100K histological images of human colorectal cancer and healthy tissue | Zenodo |
DeepLesion (PMC) | 32K CT scans with annotations and semantic labels from radiological reports | NIH |
ChestXray-NIHCC (arXiv) | 100K radiographs with labels from more than 30,000 patients | NIH |
ISIC | An archive containing 23K skin lesion images with labels & Imaging | Official site |
Dataset (Paper) | Description | Link |
---|---|---|
1000 Genomes Project (Nature) | A comprehensive catalog of human genetic variations | Official site |
ENCODE (Nature) | A platform of genomics data and encyclopedia with integrative-level and ground-level annotations | NIH |
dbSNP (NIH) | A collection of human single nucleotide variations, microsatellites, and small-scale insertions and deletions | NIH |
Dataset (Paper) | Description | Link |
---|---|---|
DrugChat (arXiv) | 143,517 question-answer pairs covering 10,834 drug compounds, collected from PubChem and ChEMBL | Github |
PubChem (NIH) | A collection of 900+ sources of chemical information data | NIH |
DrugBank (NIH) | A web-enabled structured database of molecular information about drugs | Official site |
ChEMBL (NIH) | 20M bioactivity measurements for 2.4M distinct compounds and 15K protein targets | Official site |
Dataset (Paper) | Description | Link |
---|---|---|
RadGenome-Chest CT (arXiv) | A dataset of 3D chest CT, including 197 organ-level segmentation masks, 665K multi-granularity grounded reports, and 1.3M grounded VQA pairs | - |
OmniMedVQA (arXiv) | 131,813 question-answering items with 120,530 images from 12 modalities and 26 human anatomical regions, collected from 75 medical datasets | - |
SAT-DS (arXiv) | 11,462 scans with 142,254 segmentation annotations spanning 8 human body regions from 31 medical image segmentation datasets, together with domain knowledge from e-Anatomy and UMLS | Github |
PathChatInstruct (arXiv) | 257,004 instructions of pathology-specific queries with image and text | - |
Chi-Med-VL (arXiv) | 580,014 image-text pairs and 469,441 question-answer pairs for general healthcare in Chinese | Github |
MedMD (arXiv) | 15.5M 2D scans and 180k 3D radiology scans with textual descriptions | Github |
OpenPath (Nature) | 208,414 pathology images paired with natural language descriptions | Huggingface |
Quilt-1M (arXiv) | 1M image-text pairs for histopathology | Github |
Med-MMHL (arXiv) | Human- and LLM-generated misinformation detection dataset | Github |
Mol-Instructions (arXiv) | 148K molecule-oriented, 505K protein-oriented, and biomolecular text instructions | Huggingface |
PathInstruct (arXiv) | 180K samples of LLM-generated instruction-following data | Github |
PMC-VQA (arXiv) | 227K VQA pairs of 149K images of various modalities or diseases | Github |
PMC-OA (arXiv) | 1.6M fine-grained biomedical image-text pairs | Github |
PathCap (arXiv) | 142K pathology image-caption pairs from various sources | Github |
SwissProtCLAP (arXiv) | 441K text-protein sequence pairs | Github |
MIMIC-IV (Nature) | Clinical information for hospital stays of over 60,000 patients | Official site |
MIMIC-CXR (Nature) | 227,835 chest imaging studies with free-text reports for 65,379 patients | PhysioNet |
TCGA | A landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types | Official site |