/SAT-DS

The official repository to build SAT-DS, a medical data collection of 72 public segmentation datasets, contains over 22K 3D images, 302K segmentation masks and 497 classes from 3 different modalities (MRI, CT, PET) and 8 human body regions.

Primary LanguagePython

SAT-DS

Dropbox arXiv Model

This is the official repository to build SAT-DS, a medical data collection of 72 public segmentation datasets, contains over 22K 3D images, 302K segmentation masks and 497 classes from 3 different modalities (MRI, CT, PET) and 8 human body regions. 🚀

Based on this data collection, we build an universal segmentation model for 3D radiology scans driven by text prompts (check this repo and our paper).

The data collection will continuously growing, stay tuned!

Hightlight

🎉 To save your time from downloading and preprocess so many datasets, we offer shortcut download links of 42/72 datasets in SAT-DS, which allow re-attribution with licenses such as CC BY-SA. Find them in dropbox.

All these datasets are preprocessed and packaged by us for your convenience, ready for immediate use upon download and extraction. Download the datasets you need and unzip them in data/nii, these datasets can be used immediately with the paired jsonl files in data/jsonl, check Step 3 below for how to use them. Note that we respect and adhere to the licenses of all the datasets, if we incorrectly reattribute any of them, please contact us.

What we have done in building SAT-DS:

  • Collect as many public datasets as possible for 3D medical segmentation, and compile their basic information;
  • Check and normalize image scans in each dataset, including orientation, spacing and intensity;
  • Check, standardize, and merge the label names for categories in each dataset;
  • Carefully split each dataset into train and test set by the patient id.

What we offer in this repo:

  • (Step 1) Access to each dataset in SAT-DS.
  • (Step 2) Code to preprocess samples in each dataset.
  • (Shortcut to skip Step 1 and 2) Access to preprocessed and packaged datasets that can be used immediately.
  • (Step 3) Code to load samples with normalized image, standardized class names from each dataset.
  • (Step 3) Code to visualize and check the samples.
  • (Step 4) Code to prepare the train and evaluation data for SAT in required format.
  • (Step 5) Code to split the dataset into train and test in consistent with SAT.

This repo can be used to:

  • (Follow step 1~3) Preprocess and unfied a large-scale and comprehensive 3D medical segmentation data collection, suitable to train or finetune universal segmentation models like SAM2.
  • (Follow step 1~6) Prepare the training and test data in required format for SAT.

Check our paper "One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts" for more details.

ArXiv

Website

Example Figure

Step 1: Download datasets

This is the detailed list of all the datasets and their official download links. Their citation information can be found in citation.bib .

As a shortcut, we preprocess, package and re-attribute some of them for your convenient use. Download them here.

Dataset Name Modality Region Classes Scans Download link
AbdomenCT1K CT Abdomen 4 988 https://github.com/JunMa11/AbdomenCT-1K
ACDC CT Thorax 4 300 https://humanheart-project.creatis.insa-lyon.fr/database/
AMOS CT CT Abdomen 16 300 https://zenodo.org/records/7262581
AMOS MRI MRI Thorax 16 60 https://zenodo.org/records/7262581
ATLASR2 MRI Brain 1 654 http://fcon_1000.projects.nitrc.org/indi/retro/atlas.html
ATLAS MRI Abdomen 2 60 https://atlas-challenge.u-bourgogne.fr
autoPET PET Whole Body 1 501 https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=93258287
Brain Atlas MRI Brain 108 30 http://brain-development.org/
BrainPTM MRI Brain 7 60 https://brainptm-2021.grand-challenge.org/
BraTS2023 GLI MRI Brain 4 5004 https://www.synapse.org/#!Synapse:syn51514105
BraTS2023 MEN MRI Brain 4 4000 https://www.synapse.org/#!Synapse:syn51514106
BraTS2023 MET MRI Brain 4 951 https://www.synapse.org/#!Synapse:syn51514107
BraTS2023 PED MRI Brain 4 396 https://www.synapse.org/#!Synapse:syn51514108
BraTS2023 SSA MRI Brain 4 240 https://www.synapse.org/#!Synapse:syn51514109
BTCV Abdomen CT Abdomen 15 30 https://www.synapse.org/#!Synapse:syn3193805/wiki/217789
BTCV Cervix CT Abdomen 4 30 https://www.synapse.org/Synapse:syn3378972
CHAOS CT CT Abdomen 1 20 https://chaos.grand-challenge.org/
CHAOS MRI MRI Abdomen 5 60 https://chaos.grand-challenge.org/
CMRxMotion MRI Thorax 4 138 https://www.synapse.org/#!Synapse:syn28503327/files/
Couinaud CT Abdomen 10 161 https://github.com/GLCUnet/dataset
COVID-19 CT Seg CT Thorax 4 20 https://github.com/JunMa11/COVID-19-CT-Seg-Benchmark
CrossMoDA2021 MRI Head and Neck 2 105 https://crossmoda.grand-challenge.org/Data/
CT-ORG CT Whole Body 6 140 https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=61080890
CTPelvic1K CT Lower Limb 5 117 https://zenodo.org/record/4588403#.YEyLq_0zaCo
DAP Atlas CT Whole Body 179 533 https://github.com/alexanderjaus/AtlasDataset
FeTA2022 MRI Brain 7 80 https://feta.grand-challenge.org/data-download/
FLARE22 CT Abdomen 15 50 https://flare22.grand-challenge.org/
FUMPE CT Thorax 1 35 https://www.kaggle.com/datasets/andrewmvd/pulmonary-embolism-in-ct-images
HAN Seg CT Head and Neck 41 41 https://zenodo.org/record/
HECKTOR2022 PET Head and Neck 2 524 https://hecktor.grand-challenge.org/Data/
INSTANCE CT Brain 1 100 https://instance.grand-challenge.org/Dataset/
ISLES2022 MRI Brain 1 500 http://www.isles-challenge.org/
KiPA22 CT Abdomen 4 70 https://kipa22.grand-challenge.org/dataset/
KiTS23 CT Abdomen 3 489 https://github.com/neheller/kits23
LAScarQS2022 Task 1 MRI Thorax 2 60 https://zmiclab.github.io/projects/lascarqs22/data.html
LAScarQS2022 Task 2 MRI Thorax 1 130 https://zmiclab.github.io/projects/lascarqs22/data.html
LNDb CT Thorax 1 236 https://zenodo.org/record/7153205#.Yz_oVHbMJPZ
LUNA16 CT Thorax 1 888 https://luna16.grand-challenge.org/
MM-WHS CT CT Thorax 9 40 https://mega.nz/folder/UNMF2YYI#1cqJVzo4p_wESv9P_pc8uA
MM-WHS MR MRI Thorax 9 40 https://mega.nz/folder/UNMF2YYI#1cqJVzo4p_wESv9P_pc8uA
MRSpineSeg MRI Spine 23 91 https://www.cg.informatik.uni-siegen.de/en/spine-segmentation-and-analysis
MSD Cardiac MRI Thorax 1 20 http://medicaldecathlon.com/
MSD Colon CT Abdomen 1 126 http://medicaldecathlon.com/
MSD HepaticVessel CT Abdomen 2 303 http://medicaldecathlon.com/
MSD Hippocampus MRI Brain 3 260 http://medicaldecathlon.com/
MSD Liver CT Abdomen 2 131 http://medicaldecathlon.com/
MSD Lung CT Thorax 1 63 http://medicaldecathlon.com/
MSD Pancreas CT Abdomen 2 281 http://medicaldecathlon.com/
MSD Prostate MRI Pelvis 2 64 http://medicaldecathlon.com/
MSD Spleen CT Abdomen 1 41 http://medicaldecathlon.com/
MyoPS2020 MRI Thorax 6 135 https://mega.nz/folder/BRdnDISQ#FnCg9ykPlTWYe5hrRZxi-w
NSCLC CT Thorax 2 85 https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=68551327
Pancreas CT CT Abdomen 1 80 https://wiki.cancerimagingarchive.net/display/public/pancreas-ct
Parse2022 CT Thorax 1 100 https://parse2022.grand-challenge.org/Dataset/
PDDCA CT Head and Neck 12 48 https://www.imagenglab.com/newsite/pddca/
PROMISE12 MRI Pelvis 1 50 https://promise12.grand-challenge.org/Details/
SEGA CT Whole Body 1 56 https://multicenteraorta.grand-challenge.org/data/
SegRap2023 Task1 CT Head and Neck 61 120 https://segrap2023.grand-challenge.org/
SegRap2023 Task2 CT Thorax 2 120 https://segrap2023.grand-challenge.org/
SegTHOR CT Thorax 4 40 https://competitions.codalab.org/competitions/21145#learn_the_details
SKI10 CT Upper Limb 4 99 https://ambellan.de/sharing/QjrntLwah
SLIVER07 CT Abdomen 1 20 https://sliver07.grand-challenge.org/
ToothFairy MRI Head and Neck 4 153 https://ditto.ing.unimore.it/toothfairy/
TotalSegmentator Cardiac CT Whole Body 17 1202 https://zenodo.org/record/6802614
TotalSegmentator Muscles CT Whole Body 31 1202 https://zenodo.org/record/6802614
TotalSegmentator Organs CT Whole Body 24 1202 https://zenodo.org/record/6802614
TotalSegmentator Ribs CT Whole Body 39 1202 https://zenodo.org/record/6802614
TotalSegmentator Vertebrae CT Whole Body 29 1202 https://zenodo.org/record/6802614
TotalSegmentator V2 CT Whole Body 24 1202 https://zenodo.org/record/6802614
VerSe CT Whole Body 29 96 https://github.com/anjany/verse
WMH MRI Brain 1 170 https://wmh.isi.uu.nl/
WORD CT Abdomen 18 150 https://github.com/HiLab-git/WORD

Step 2: Preprocess datasets

For each dataset, we need to find all the image and mask pairs, and another 5 basic information: dataset name, modality, label name, patient ids (to split train-test set) and official split (if provided).
In processor.py, we customize the process procedure for each dataset, to generate a jsonl file including these information for each sample.
Take AbdomenCT1K for instance, you need to run the following command:

python processor.py \
--dataset_name AbdomenCT1K \
--root_path 'SAT-DS/data/nii/AbdomenCT-1K' \
--jsonl_dir 'SAT-DS/data/jsonl'

root_path should be where you download and place the data, jsonl_dir should be where you plan to place the jsonl files.
⚠️ Note the dataset_name and the name in the table might not be exactly the same. For specific details, please refer to each process function in processor.py.
After process, each sample in jsonl files would be like:

{
  'image' :"SAT-DS/data/nii/AbdomenCT-1K/Images/Case_00558_0000.nii.gz",
  'mask': "SAT-DS/data/nii/AbdomenCT-1K/Masks/Case_00558.nii.gz",
  'label': ["liver", "kidney", "spleen", "pancreas"],
  'modality': 'CT',
  'dataset': 'AbdomenCT1K,
  'official_split': 'unknown',
  'patient_id': 'Case_00558_0000.nii.gz',
}

Note that in this step, we may convert the image and mask into new nifiti files for some datasets, such as TotalSegmentator and so on. So it may take some time.

Shortcut to skip Step 1 and 2: Download the preprocessed and packaged data for immediate use

We offer shortcut download links of 42 datasets in dropbox. All these datasets are preprocessed and packaged in advance. Download the datasets you need and unzip them in data/nii, each dataset is paired with a jsonl file in data/jsonl.

Step 3: Load data with unified normalization

With the generated jsonl file, a dataset is now ready to be used.
However, when mixing all the datasets to train a universal segmentation model, we need to apply normalization on the image intensity, orientation, spacing across all the datasets, and adjust labels if necessary.
We realize this by customizing the load script for each dataset in loader.py, this is a simple demo how to use it in your code:

from loader import Loader_Wrapper

loader = Loader_Wrapper()
    
# load samples from jsonl
with open('SAT-DS/data/jsonl', 'r') as f:
    lines = f.readlines()
    data = [json.loads(line) for line in lines]

# load a sample
for sample in data:
    batch = getattr(loader, func_name)(sample)
    img_tensor, mc_mask, text_ls, modality, image_path, mask_path = batch

For each sample, whatever the dataset it comes from, the loader will give output in a normalized format:

img_tensor  # tensor with shape (1, H, W, D)
mc_mask  # binary tensor with shape (N, H, W, D), one channel for each class;
text_ls  # a list of N class name;
modality  # MRI, CT or PET;
image_path  # path to the loaded mask file;
mask_path  # path to the loaded imag file;

⚠️ Note that we may merge and adjust labels here in the loader. Therefore, the output text_ls may be different from the label you see in the input jsonl file. Here is an case where we merge left kidney' and right kidneyfor a new labelkidney` when loading examples from CHAOS_MRI:

kidney = mask[1] + mask[2]
mask = torch.cat((mask, kidney.unsqueeze(0)), dim=0)
labels.append("kidney")

And here is another case where we adjust the annotation of kidney by integrating the annotation of kidney tumor and kidney cyst:

mc_masks[0] += mc_masks[1]
mc_masks[0] += mc_masks[2]

We also offer the shortcut to visualize and check any sample in any dataset after normalization. For example, to visualize the first sample in AbdomenCT1K.jsonl, just run the following command:

python loader.py \
--visualization_dir 'SAT-DS/data/visualization' \
--path2jsonl 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl' \
--i 0

(Optional) Step 4: Convert to npy files

For convenience, before training SAT, we normalize all the data according to step 3, and convert the images and segmentation masks to npy files. If you try to use our training code, run this command for each dataset:

python convert_to_npy.py \
--jsonl2load 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl' \
--jsonl2save 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl'

The converted npy files will be saved in preprocessed_npy/dataset_name, and some new information will be added to the jsonl file for connivence to load the npy files.

(Optional) Step 5: Split train and test set

We offer the train-test split used in our paper for each dataset in json files. To follow our split and benchmark your method, simply run this command:

python train_test_split.py \
--jsonl2split 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl' \
--train_jsonl 'SAT-DS/data/trainset_jsonl/AbdomenCT1K.jsonl' \
--test_jsonl 'SAT-DS/data/testset_jsonl/AbdomenCT1K.jsonl' \
--split_json 'SAT-DS/data/split_json/AbdomenCT1K.json'

This will split the jsonl file into train and test.

Or, if you want to re-split them, just customize your split by identifying the patient_id in the json file (patient_id of each sample can be found in jsonl file of each dataset):

{'train':['train_patient_id1', ...], 'test':['test_patient_id1', ...]}

(Optional) Step 6: DIY your data collection

You may want to customize the dataset collection in training your model, simply merge the train jsonls of the data you want to involve. For example, merge the jsonls for all the 72 datasets into train.jsonl, and you can use them together to train SAT, using our training code in this repo.

Similarly, you can customize a benchmark with arbitrary datasets you want by merging the test jsonls.

Citation

If you use this code for your research or project, please cite:

@arxiv{zhao2023model,
  title={One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompt}, 
  author={Ziheng Zhao and Yao Zhang and Chaoyi Wu and Xiaoman Zhang and Ya Zhang and Yanfeng Wang and Weidi Xie},
  year={2023},
  journal={arXiv preprint arXiv:2312.17183},
}

And if you use any of these datasets in SAT-DS, please cite the corresponding papers. A summerized citation information can be found in citation.bib .