/Vision-and-Language-in-Medicine

Official code of MICCAI'23 paper "Text-guided Foundation Model Adaptation for Pathological Image Classification"

Primary LanguagePython

[Code] CITE: Connecting Image and Text Embeddings


Updated on 2023.12.26

Key Features

This repository provides the official implementation of Text-guided Foundation Model Adaptation for Pathological Image Classification.

  • Foundation model adaptation to medical imaging analysis
  • Data-efficient and low-cost visual prompt tuning
  • Injection of medical in-domain knowledge via text
  • Compatibility with various foundation models

Links

Details

The recent surge of foundation models in computer vision and natural language processing opens up perspectives in utilizing multi-modal clinical data to train large models with strong generalizability. Yet pathological image datasets often lack biomedical text annotation and enrichment. Guiding data-efficient image diagnosis from the use of biomedical text knowledge becomes a substantial interest. In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification. CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding. Through extensive experiments on the PatchGastric stomach tumor pathological image dataset, we demonstrate that CITE achieves leading performance compared with various baselines especially when training data is scarce. CITE offers insights into leveraging in-domain text knowledge to reinforce data-efficient pathological image classification.

An overview of CITE:

Dataset

The PatchGastric dataset includes histopathological image patches extracted from H&E stained whole slide images (WSI) of stomach adenocarcinoma endoscopic biopsy specimens. The dataset contains 9 subtypes of gastric adenocarcinoma WSIs. We choose 3 major subtypes including “well differentiated tubular adenocarcinoma”, “moderately differentiated tubular adenocarcinoma”, and “poorly differentiated adenocarcinoma” to form a 3-class grading-like classification task with 179,285 patches of size 300x300 from 693 WSIs.

To prepare the PatchGastric dataset:

  1. Download captions.csv and patches_captions.zip from PatchGastricADC22.
  2. Put them in data/ and unzip the file.

Get Started

Main Requirements

torch==1.13.0
mmcls==0.25.0
transformers
clip

Installation

conda create -n CITE python=3.9
conda activate CITE
conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install openmim
mim install mmcls==0.25.0
pip install -r requirements.txt

Preprocess

To follow our split of the dataset, please generate the annotation files by running:

python tools/ann.py

Or you can generate your own split following mmcls format:

filename label

Training

The config files follow mmcls style.

PYTHONPATH=.:$PYTHONPATH mim train mmcls <config>

Testing

PYTHONPATH=.:$PYTHONPATH mim test mmcls <config> --checkpoint <checkpoint> --metrics <metrics>

🙋‍♀️ Feedback and Contact

📝 Citation

@inproceedings{zhang2023text,
  title={Text-guided Foundation Model Adaptation for Pathological Image Classification},
  author={Zhang, Yunkun and Gao, Jin and Zhou, Mu and Wang, Xiaosong and Qiao, Yu and Zhang, Shaoting and Wang, Dequan},
  booktitle={MICCAI},
  year={2023}
}

🗃️ Materials

We provide a comprehensive overview of current open-source medical language models, vision foundation models, and vision-language models, illustrating their applicability to our approach (CITE). For BERT-based language models, you may directly replace model->head->text_encoder->model and model->neck->out_features with your preferred Huggingface🤗 model in the config file to run CITE.

Medical Language Models

Model Subfield Paper Code Base
Meditron Medicine Meditron-70B: Scaling Medical Pretraining for Large Language Models Github LLaMA 2
RadFM Radiology Towards Generalist Foundation Model for Radiology Github LLaMA
BioMedGPT Biomedicine BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine Github LLaMA 2
Med-PaLM 2 Clinic Towards Expert-Level Medical Question Answering with Large Language Models Google PaLM 2
PMC-LLaMA Medicine PMC-LLaMA: Towards Building Open-source Language Models for Medicine Github LLaMA
BenTsao (HuaTuo) Biomedicine HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge Github LLaMA
ChatDoctor Medicine ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge Github LLaMA
Clinical-T5 Clinic Clinical-T5: Large Language Models Built Using Mimic Clinical Text PhysioNet T5
Med-PaLM Clinic Large Language Models Encode Clinical Knowledge Google PaLM
BioGPT Biomedicine BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining Github GPT-2
BioLinkBERT Biomedicine Linkbert: Pretraining Language Models with Document Links Github BERT
PubMedBERT Biomedicine Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing Microsoft BERT
BioBERT Biomedicine BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining Github BERT
BlueBERT Biomedicine An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining Github BERT
Clinical BERT Clinic Publicly Available Clinical BERT Embeddings Github BERT
SciBERT Biomedicine SciBERT: A Pretrained Language Model for Scientific Text Github BERT

Vision Models

Model Subfield Paper Code Base
REMEDIS Radiology Robust and Data-Efficient Generalization of Self-Supervised Machine Learning for Diagnostic Imaging Github SimCLR
RETFound Retinopathy A Foundation Model for Generalizable Disease Detection from Retinal Images Github MAE
CTransPath Pathology Transformer-Based Unsupervised Contrastive Learning for Histopathological Image Classification Github -
HIPT Pathology Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning Github DINO
INTERN-2.5 General InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Github -
DINOv2 General DINOv2: Learning Robust Visual Features without Supervision Github -
MAE General Masked Autoencoders are Scalable Vision Learners Github -
ViT (ImageNet) General An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Huggingface -

Vision-Language Models

Model Subfield Paper Code Base
Qilin-Med-VL Radiology Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare Github LLaVA
RadFM Radiology Towards Generalist Foundation Model for Radiology Github -
KAD Radiology Knowledge-Enhanced Visual-Language Pre-Training on Chest Radiology Images Github CLIP
Med-Flamingo Medicine Med-Flamingo: A Multimodal Medical Few-Shot Learner Github Flamingo
QuiltNet Pathology Quilt-1M: One Million Image-Text Pairs for Histopathology Github CLIP
PLIP Pathology A Visual-Language Foundation Model for Pathology Image Analysis Using Medical Twitter Huggingface CLIP
MI-Zero Pathology Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images Github CLIP
LLaVA-Med Biomedicine LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day Github LLaVA
MedVInT Biomedicine PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering Github -
PMC-CLIP Biomedicine PMC-CLIP: Contrastive Language-Image Pre-Training Using Biomedical Documents Github CLIP
BiomedCLIP Biomedicine Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing Huggingface CLIP
MedCLIP Medicine MedCLIP: Contrastive Learning from Unpaired Medical Images and Text Github CLIP
CheXzero Radiology Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning Github CLIP
PubMedCLIP Radiology Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain? Github CLIP
LLaVA Genearl Visual Instruction Tuning Github -
Flamingo General Flamingo: a Visual Language Model for Few-Shot Learning OpenFlamingo -
CLIP General Learning Transferable Visual Models From Natural Language Supervision Github -