mlpc-ucsd/TokenCompose

Datasets on reproducing FID results

Closed this issue · 7 comments

    Thank you for your contribution! 
    I am unable to reproduce the FID results. 
    Can you provide 10000 image-caption pairs sampled from COCO validation set (C) and 1000 image-caption pairs from Flickr30K entities validation set (F)? 
    Thanks!

Hi. Thank you for your interest. For the COCO FID, we followed the evaluation method used by huggingface Stable Diffuision1.4. You can check on this link https://huggingface.co/CompVis/stable-diffusion-v1-4 . Namely, they use 10k random prompts from COCO 2017 Validation set.
image

你好。感谢您的关注。对于COCO FID,我们遵循huggingface Stable Diffuision1.4使用的评估方法。您可以在此链接 https://huggingface.co/CompVis/stable-diffusion-v1-4 查看。也就是说,他们使用来自 COCO 2017 验证集的 10k 随机提示。 image

Thank you for your reply!
Perhaps I was wrong, but why did I find out that Coco 2017 val set only has 5k prompts.

There are 5k images in coco 17 val set but around 25k captions. Prompts are defined at the caption-level, not the image-level.

val2017_rand_index.json

Feel free to download the above indices file if you want to ensure a fair comparison, though the variance should be very small even if you don't use exactly these indices. We center crop the images to ensure that they all have a square shape and use this codebase to calculate the FID.

The same pipeline also applies for the Flickr30K entities (val set).

Let us know if you have any other questions -- thanks!

coco 17 val 套装中有 5k 图像,但大约有 25k 字幕。提示是在标题级别定义的,而不是在图像级别定义的。

val2017_rand_index.json

如果您想确保公平的比较,请随时下载上述索引文件,尽管即使您不完全使用这些索引,方差也应该非常小。我们将图像居中裁剪,以确保它们都具有正方形,并使用此代码库来计算 FID。

同样的管道也适用于 Flickr30K 实体(val 集)。

如果您有任何其他问题,请告诉我们 - 谢谢!

Thank you for your reply, it has been of great help to me.
I have another question, how the efficiency indicator was obtained, especially how the standard deviation of efficiency was obtained.
Look forward to your reply!

You can generate a number of images (say 100) with the same set of prompts for all your baselines using the same GPU. Then you time it and take the avg/std for efficiency metrics. We strongly recommend that you manually do this for all your baselines (instead of taking our numbers) for this column because latency can differ even for the same GPU due to different system configs, hardware brand, software, etc.

You can generate a number of images (say 100) with the same set of prompts for all your baselines using the same GPU. Then you time it and take the avg/std for efficiency metrics. We strongly recommend that you manually do this for all your baselines (instead of taking our numbers) for this column because latency can differ even for the same GPU due to different system configs, hardware brand, software, etc.

Thank you for your reply, it has been of great help to me.
Wishing you a happy life.

I assume that all your questions have been resolved as of now, so I'm closing this issue. If you encounter any other questions, feel free to re-open it or submit a new issue. Have a great rest of your day!