Replicating Fine-tuning Results

Question

Replicating Fine-tuning Results

Closed this issue 4 months ago · 27 comments

@srikarym I'm trying to replicate the results provided in your paper using this codebase. However, when running your code, I obtain the following loss plots as shown in weights and biases:

As you can see, the training loss does not change nearly at all over the course of a couple of training epochs.

I'm using the following datasets and checkpoints directly downloaded from this repo:

I then run the code on 3 GPUs using the following command:
python -u main.py -t --gpus 0,1,2 --base config.yaml

Given the model does not appear to be learning at all, can you please aid in fixing this issue?

Answer 1 · 2024-05-28T16:46:25.000Z

I believe it's an issue commonly faced in Diffusion model training. The denoising loss is not entirely reflective of image generation quality. You could visually check if the generated images are getting better, or measure the FID.

Answer 2 · 2024-06-25T03:40:06.000Z

hello，Have you encountered a situation where the fid remains at a large value during the training process?As I encountered in #23

Answer 3 · 2024-06-25T15:21:14.000Z

@srikarym I had a similar issue as @sjjadsa found in #23. Neither the loss nor the FID score changes throughout training despite using the data, checkpoints, and configurations you provided.

Answer 4 · 2024-06-25T15:44:33.000Z

I believe #23 didn't use the entire training data. The user reports they've finished 531,899 epochs. That's physically impossible as each epoch took us around 2 days.

Answer 5 · 2024-06-25T16:13:12.000Z

@srikarym That makes sense. Something doesn't seem right based on that number.

I still have the problem that my FID is around 150 despite training for 50 epochs.
This is using all of the provided checkpoints and data as mentioned in the original comment.
I also used 3 GPUs and seemingly the same batch size to avoid issues with the learning rate.

How long did you train for to achieve the results in the paper?

Answer 6 · 2024-06-25T16:26:10.000Z

How do the generated samples look?

Answer 7 · 2024-06-25T17:31:40.000Z

@srikarym

The images attached are from the "samples_gs" ones. They don't look great.
The overall structure isn't correct and doesn't resemble a real histology image.
I think the main issue is that using the FID metric doesn't allow for proper evaluation.

Answer 8 · 2024-06-26T04:37:27.000Z

@LoadinggniaoL What operations did you perform to solve the problem of FID score not changing?

Answer 9 · 2024-06-26T19:48:04.000Z

@sjjadsa I wasn't able to fix that. Despite trying around 50 configurations, I was not able to replicate their results using the provided data, checkpoints, and even their configuration files.

The images shown above are from the best run where I obtained an FID of around 70. At this point I'm no longer using this codebase and will rewrite the code myself.

Answer 10 · 2024-06-26T19:53:30.000Z

@LoadinggniaoL did you generate samples using the same text report for computing FID? We randomly pick reports, generate 10k samples and compute FID.

Answer 11 · 2024-06-26T20:00:25.000Z

@srikarym I computed the FID in a similar way. I randomly selected around 5000 text reports then sent the generated images through the pytorch-fid library. The images are in the expected range, but ignoring the FID, the output images are clearly not histologically relevant. They display features atypical of histology images so I tend to believe the high FID score is accurate.

Can you provide the exact training setup including the data, configuration files, and checkpoints in order for us to replicate these results? It's unclear exactly which setup was used based on prior comments. For instance you've mentioned 12 million training images multiple times yet the provided BRCA data only includes about 1.2 million images (assuming this is a simple mistake).

Answer 12 · 2024-06-26T20:07:53.000Z

Where did you get these 5000 reports from? We used ~1000 WSI and report summaries from TCGA BRCA for training. Did you also append low / high tumor and TIL in the beginning of the summarized report?

Answer 13 · 2024-06-27T02:41:29.000Z

The 5000 reports were created based on a subset of the original ~1000 reports. I prepended various combinations of the "tumor/til" prefixes as well as other commands like "histopathology whole-slide image with . Regardless of the origin, the fid score remained the same indicating that there is a more fundamental problem with the model.

Also, in the paper you mention 3.2 million patches but then the repo provides 1.2 million. What are the expected results on the 1.2 million patch version?

As a follow-up, in the paper it is worded in a way that suggests you used the training set text for validation.
Is this the case?

Answer 14 · 2024-06-27T04:32:52.000Z

The data provided contains 1.2 million patches at 448x448 resolution and 10x magnification. During training, we take 256x256 random crops, which makes the expected size of the dataset 3.2 million.
Did you take random crops during training? The FID statistics we provide are for 256x256 real image crops at 10x

Answer 15 · 2024-06-27T15:40:32.000Z

I am using your exact codebase and did take random crops during training.

On the other question, it seems like the training set was used to compute the FID score in the paper? Please clarify whether this is the case or not?

Can you please provide the links to the exact checkpoints, configuration, and anything else used during training in order to replicate your results?

Answer 16 · 2024-06-27T16:47:33.000Z

All the FID scores reported in our paper are obtained using the same text reports used for training - this includes both our models and comparisons such as Stable diffusion.
When we used the validation set text reports, FID was ~10.

Answer 17 · 2024-06-27T16:56:14.000Z

Wouldn't this unfairly bias the results towards your model though? Considering the base stable diffusion checkpoints didn't have access to that training data.

Also, would it be possible to provide the exact checkpoints tested? It's still unclear based on the prior answers to issues on this repo.

Answer 18 · 2024-06-27T17:01:59.000Z

The comparison is with Stable diffusion finetuned on these patches, not the base version.
You can find checkpoints here. Best performing model was finetuned from ImageNet weights, and conditioned on text + tumor / TIL using PLIP encoder.

Answer 19 · 2024-06-27T17:07:31.000Z

@srikarym Did you do anything to verify the model wasn't overfitting on the training set then? If reporting on the training set, there's no indication that the FID means anything. You can simply achieve high FID by producing memorized training data.

Answer 20 · 2024-06-27T17:11:01.000Z

@srikarym Also, I'm asking for the starting model checkpoints, not just the final ones.

Answer 21 · 2024-06-27T17:17:14.000Z

We used cin256-v2 model for the U-Net, which is an ImageNet pretrained model provided by the original LDM repo (see this).
For the Autoencoder, we finetune the vq-f4 VAE on 10x BRCA image patches. The VAE weights can be extracted from our final diffusion checkpoint (#6)

Reg overfitting - we perform data augmentation for a patch level tumor classification task, and observe that synthetic data improves performance.

Answer 22 · 2024-06-27T17:22:30.000Z

Thanks for the clarification on the checkpoints! This helps a lot.

Back to overfitting, when testing the added performance using the synthetic data, the explained experiment doesn't show that unless you held the number of real and synthetic training samples constant. Otherwise, you cannot eliminate the performance improvement is due to simply adding more data regardless of the quality. Plus, in Table 5 in the paper, which dataset did you use to evaluate the performance? Was the accuracy computed on a single dataset and if so, which one?

Answer 23 · 2024-06-27T17:36:36.000Z

In Table 5, we used the same 10x patches from TCGA BRCA for training the tumor classifier. To generate synthetic samples, we randomly pick a text report, append low / high tumor, and assign the corresponding label to the synthetic patch.
Since the text reports are at the WSI level, the type of downstream / augmentation tasks we can perform is limited.

Answer 24 · 2024-06-27T17:49:28.000Z

But when doing the training did you hold the number of samples constant regardless of real or synthetic?
It's unclear whether this is an equitable comparison.

Answer 25 · 2024-06-27T17:52:50.000Z

We used an equal number of real and synthetic samples. When mixing both, the training set is doubled. It's a common setup used in evaluating synthetic samples from diffusion models https://arxiv.org/pdf/2304.08466

Answer 26 · 2024-06-27T18:29:45.000Z

Thanks so much for the information and all of the help!

As a brief followup, did you ever experiment with whether the FID metric is a good indicator of histology image quality?
Working through that right now as I'm unsure whether the inception model can capture histology features adequately.

Answer 27 · 2024-06-27T18:40:16.000Z

It's not the best metric for histology images, but it's good enough to compare different models. In our follow up CVPR paper, we train diffusion models conditioned on SSL embeddings, and use multiple metrics such as CLIP FID, embedding similarity, etc.