dewenzeng/positional_cl

Difficulty reproducing fine-tuning results

joshestein opened this issue · 11 comments

Hello! Thank you for the great paper and for providing your code as open source.

I have been having some difficulty reproducing the fine-tuning results from table 1 in the paper. I'm working with the ACDC dataset.

Here are the steps I have followed:

  1. Run generate_acdc.py, yielding two output folders - one for labeled data, one for unlabeled.
  2. Download the provided pretrained ACDC PCL model from google drive.
  3. Run CUDA_VISIBLE_DEVICES=0 python train_supervised.py --device cuda:0 --batch_size 10 --epochs 100 --data_dir <ACDC_dir/labeled_dir> --lr 5e-5 --min_lr 5e-6 --dataset acdc --patch_size 352 352 \ --experiment_name supervised_acdc_pcl_sample_2 --initial_filter_size 48 --classes 4 --enable_few_data --sampling_k 2 --restart --pretrained_model <pretrained_model.pth>.

I am trying to reproduce the table's first column, so am setting sampling_k=2.

I am getting the following validation dice results (at the end of the 100 epochs):

Fold Validation dice
0 0.1878
1 0.2449
2 0.2729
3 0.1103
4 0.1079

Average validation dice = 0.185 << 0.671

Running the same experiment, but setting sampling_k=6 (i.e. the second column) I get an average validation dice = (0.6847 + 0.7170 + 0.5431 + 0.5345 + 0.640) / 5.0 = 0.624 < 0.850

I would appreciate any help or guidance! Please let me know if I am doing something wrong, or if I am interpreting my results incorrectly.

My suggestion is to download the provided data in the issues instead of processing the data by yourself. I am not sure why. But it worked for me.

Thank you for your suggestion.

Could you elaborate on how the provided dataset was generated?

I am wondering why there could be differences between the provided generated dataset and the dataset generated by the generation script. I am also wondering why some patients are missing (e.g. patients 1, 2, 5, 7, ...) from the provided generated set?

I repeated the experiment with a slight modification to the code to skip missing patients.

I am getting the following results (using 2 cross-validation folds):

sampling_k=2: validation_dice = (0.2056 + 0.1719) / 2 = 0.18875
sampling_k=6: validation_dice = (0.6583 + 0.6256) / 2 = 0.64195

I would appreciate any guidance in running the experiments - perhaps I am doing something incorrectly?

@joshestein Could you try setting the learning rate to 1e-4? lr = 5e-5 might be too small.

Using a learning rate of 1e-4, I get the following results:

sampling_k=2: (0.3187 + 0.3361) / 2 = 0.3274
sampling_k=6: (0.8384 + 0.8181) / 2 = 0.8283

Although the second result is now closer to the published result of 0.850, the first result is still quite far from the published 0.671.

Do you have any further ideas as to what could be the problem?

@joshestein You can try further increasing the learning rate (e.g., 5e-4, 1e-3) for really small datasets cause the training is relatively unstable. The optimal hyperparameter setting could be different for different datasets and different dataset sizes. We could always vary the setting a little bit for better Dice, as long as the setting is fixed once we find a reasonable one for a fair comparison.
Accuracy variance also comes in different random seeds and machines. You can check my recent results here at your setting: https://tensorboard.dev/experiment/VHVOETpMQACQPbnOKJXeeg/#scalars

Is it possible for you to share the hyperparameters you used for the linked runs?

This is the one I used for training with 6 samples.

python train_supervised.py --device cuda:0 --batch_size 10 --epochs 100 --data_dir /afs/crc.nd.edu/user/d/dzeng2/data/acdc/acdc_contrastive/supervised/2d --lr 5e-4 --min_lr 1e-6 --dataset acdc --patch_size 352 352 \
--experiment_name supervised_acdc_pcl_sample_6_ --initial_filter_size 48 --classes 4 --enable_few_data --sampling_k 6 --restart --pretrained_model_path ./results/pretrained_acdc_pcl/model.pth

Could you please share the hyperparameters for training with 2 samples?

Just change the learning rate to 1e-3.

Thank you 😄 Sorry, I somehow missed the learning rates in the linked tensorboard graphs.

Confirmed that training with two samples with LR 1e-3 and six samples with LR 5e-4 gives expected results.