Foundation Model for Endoscopy Video Analysis

This repository provides the official PyTorch implementation of the paper Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train by Zhao Wang*, Chang Liu*, Shaoting Zhang†, and Qi Dou†.

Key Features

First foundation model for endoscopy video analysis.
A large-scale endoscopic video dataset with over 33K video clips.
Support 3 types of downstream tasks, including classification, segmentation, and detection.

Details

Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downtream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art self-supervised pre-training and adapter-based transfer learning methods by a significant margin.

Datasets

We utilize 6 public and 1 private datasets for pre-training and 3 datasets as the downstream tasks. Except for SUN & SUN-SEG, we provide our preprocessed data for pre-training and downstream tasks.

Pre-training Data (6 public + 1 private)

Colonoscopic [original paper] [original dataset] [our preprocessed dataset]
SUN & SUN-SEG [original paper1] [original paper2] [original dataset1] [original dataset2]
LPPolypVideo [original paper] [original dataset] [our preprocessed dataset]
Hyper-Kvasir [original paper] [original dataset] [our preprocessed dataset]
Kvasir-Capsule [original paper] [original dataset] [our preprocessed dataset]
CholecTriplet [original paper] [original dataset] [our preprocessed dataset]
Our Private [our preprocessed dataset]

Downstream Data (3 public)

PolypDiag [original paper] [original dataset] [our preprocessed dataset]
CVC-12k [original paper] [original dataset] [our preprocessed dataset]
KUMC [original paper] [original dataset] [our preprocessed dataset]

For SUN & SUN-SEG, you need first request the original videos following this instruction. Then, you can transfer the data for pre-training videos by the following:

cd Endo-FM/data
python sun.py
python sun_seg.py
python trans_videos_pretrain.py

Finally, generating the video list pretrain/train.csv for pre-training by the following:

cd Endo-FM/data
python gencsv.py

Get Started

Main Requirements

torch==1.8.0
torchvision==0.9.0
pillow==6.2.2
timm==0.4.12

Installation

We suggest using Anaconda to setup environment on Linux, if you have installed anaconda, you can skip this step.

wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh && zsh Anaconda3-2020.11-Linux-x86_64.sh

Then, we can install packages using provided environment.yaml.

cd Endo-FM
conda env create -f environment.yaml
conda activate endofm

Pre-trained Weights

You can directly download our pre-trained Endo-FM via this link and put it under checkpoints/ for downstream fine-tuning.

Downstream Fine-tuned Weights

Also, we provide the pre-trained weights of 3 downstream tasks for direct downstream testing.

Dataset	PolypDiag	CVC-12k	KUMC
Our Paper	90.7	73.9	84.1
Released Model	91.3	76.6	84.0
Weights	link	link	link

Pre-training

cd Endo-FM
wget -P checkpoints/ https://github.com/kahnchana/svt/releases/download/v1.0/kinetics400_vitb_ssl.pth
bash scripts/train_clips32k.sh

Downstream Fine-tuning

# PolypDiag (Classification)
cd Endo-FM
bash scripts/eval_finetune_polypdiag.sh

# CVC (Segmentation)
cd Endo-FM/TransUNet
python train.py

# KUMC (Detection)
cd Endo-FM/STMT
python setup.py build develop
python -m torch.distributed.launch \
    --nproc_per_node=1 \
    tools/train_net.py \
    --master_port=$((RANDOM + 10000)) \
    --config-file configs/STFT/kumc_R_50_STFT.yaml \
    OUTPUT_DIR log_dir/kumc_finetune

Direct Downstream Testing

# PolypDiag (Classification)
cd Endo-FM
bash scripts/test_finetune_polypdiag.sh

# CVC (Segmentation)
cd Endo-FM/TransUNet
python train.py --test

# KUMC (Detection)
cd Endo-FM/STMT
python setup.py build develop
python -m torch.distributed.launch \
    --nproc_per_node=1 \
    tools/test_net.py \
    --master_port=$((RANDOM + 10000)) \
    --config-file configs/STFT/kumc_R_50_STFT.yaml \
    MODEL.WEIGHT kumc.pth \
    OUTPUT_DIR log_dir/kumc_finetune

🙋‍♀️ Feedback and Contact

For further questions, pls feel free to contact Zhao Wang.

🛡️ License

This project is under the Apache License 2.0 license. See LICENSE for details.

🙏 Acknowledgement

Our code is based on DINO, TimeSformer, SVT, TransUNet, and STFT. Thanks them for releasing their codes.

📝 Citation

If you find this code useful, please cite in your research papers.

@inproceedings{
    wang2023foundation,
    title={Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train},
    author={Zhao Wang and Chang Liu and Shaoting Zhang and Qi Dou},
    booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
    pages={101--111},
    year={2023},
    organization={Springer}
}

XGGNet/Endo-FM