imlixinyang/HiSD

How can I reproduce the quantitative experiment results in the paper?

HyZhu39 opened this issue · 44 comments

First of all, congratulations on the results of the research,
and thank you for the concise and understandable code implementation.

But I still encountered some problems when trying to reproduce the quantitative experiment results in the paper, I did as follow:
Realism:

  1. get all test images with attribute “without bangs”(set first 3000 images of CelebA-HQ as test images, and filter the data according to the label recorded in the CelebAMask-HQ-attribute-anno.txt file)
  2. translated them with my self-trained model(with config file: celeba-hq.yaml) into attribute" with bangs" with latent-guided method(randomly generate 5 style codes) and reference-guided method(randomly sample 5 images with attribute "with_bangs" in test data as reference image);
  3. calculate FID (using code https://github.com/mseitzer/pytorch-fid) with all images with attribute "with_bangs" in test images, as the paper said.

Disentanglement:

  1. get all images with attributes "young"、“male“、“without bangs” of test images.
  2. With latent-guided method (randomly generate 5 style codes) and reference-guided method (randomly sample 5 images with attributes "with_bangs"、" young"、“male“ in test data as reference image),get translated images.
  3. calculate FID with all images with attributes "with_bangs"、" young"、“male“ in test images, as the paper said.

Then I got FID results:
L:25.05 R:25.21 G:0.16 in the "Realism" experiment
L:85.75 R:84.45 G:1.30 in the "Disentanglement" experiment
While In the paper:
L:21.37 R:21.49 G:0.12 in the "Realism" experiment
L:71.85 R:71.48 G:0.37 in the "Disentanglement" experiment

Although there are several random factors in many places in the experiment, it is normal for the FID results to have fluctuations, but these results are too bad.

I think there must be something wrong with my data processing, or the training method.
So, could you please explain the data used in the quantitative experiment and the method of data processing in detail? If possible, could you please release the model of the paper's experiment config?

There are some differences between the cleaned code and the original code indeed. But I do think that it would be better rather than worse.
Sorry for that and I would try my best to help you to reproduce the quantitative results.
I will response to you tomorrow, please wait.

There are some differences between the cleaned code and the original code indeed. But I do think that it would be better rather than worse.
Sorry for that and I would try my best to help you to reproduce the quantitative results.
I will response to you tomorrow, please wait.

thanks for your attention and your quick reply, I will look forward to your reply!

@HyZhu39 Hello, how you get the FID between images generated by 5 style codes and the real images?
The generated images for 5 style codes should be put into 5 folders as expected and calculate the average FID between each of them and the real images. Each folder has the same number of images as the original source images.
For disentanglement in our experiments, the reference-guided style codes are randomly sampled from all images with bangs.

@HyZhu39 Hello, how you get the FID between images generated by 5 style codes and the real images?
The generated images for 5 style codes should be put into 5 folders as expected and calculate the average FID between each of them and the real images. Each folder has the same number of images as the original source images.
For disentanglement in our experiments, the reference-guided style codes are randomly sampled from all images with bangs.

Actually, I did put them in one folder and calculated two folders' FID as the result, and for disentanglement experiments, I just selected from test images with bangs as reference images. Thanks for pointing out that, I'll have a try as you said and tell you the results.
I think what you said actually the point. Thanks again.

You're welcomed. Since there are same identities in one folder, the FID (which uses the variance of the image features) would definitely become bigger.

You're welcomed. Since there are same identities in one folder, the FID (which uses the variance of the image features) would definitely become bigger.

Sorry for bothering again. I tried to put generated images with 5 different style codes separately by style code they used and tested, but it seems that the results are getting worse... that's wired, I think.
I did two group experiments with the self-trained model I used in my first comment.

experiment 1:
realism:
(input images: all images with attirbute "without_bangs" of test images(first 3000 images) translated to "with_bangs";
reference images: randomly sampled 5 images with attribute "with_bangs" in all images;
calculate FID with: all images with attribute "with_bangs" of test images, and resized to 128×128)
L: R: G:
0: 26.45 26.59
1: 26.47 26.64
2: 26.44 27.04
3: 26.84 28.99
4: 25.90 26.38
average: 26.42 27.13 0.71
(randomly chosen references images:5645.jpg、6245.jpg、13652.jpg、14380.jpg、27363.jpg)

disentanglement:
(input images: all images with attirbutes "without_bangs"、" young"、“male“ of test images, translated to "with_bangs";
reference images: randomly sampled 5 images with attribute "with_bangs" in all images;
calculate FID with: all images with attributes "with_bangs"、" young"、“male“ of test images, and resized to 128×128)
L: R: G:
0: 88.79 87.49
1: 88.28 85.61
2: 87.23 92.51
3: 89.40 86.07
4: 88.30 88.11
average: 88.40 87.96 0.44
(randomly chosen references images:426.jpg、19849.jpg、22869.jpg、26513.jpg、28732.jpg)

experiment 2:
realism:same setting as experiment 1;
L: R: G:
0: 27.53 26.78
1: 32.38 26.40
2: 25.72 31.98
3: 28.18 27.48
4: 26.58 27.02
average: 28.08 27.93 0.17
(randomly chosen references images:5645.jpg、6245.jpg、13652.jpg、14380.jpg、27363.jpg)

disentanglement:same setting as experiment 1;
L: R: G:
0: 86.59 86.61
1: 89.13 90.18
2: 85.41 94.21
3: 89.02 87.94
4: 86.36 91.12
average: 87.30 90.01 2.71
(randomly chosen references images:923.jpg、1232.jpg、12886.jpg、24491.jpg、26797.jpg)

I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
by the way, I trained the model with a single GTX 1080Ti 11GB GPU for 200000 iter steps as the config file: celeba-hq.yaml.

Actually, you need to randomly sample the reference images for each source image. If you sample only one reference image to translate all the source images into 'with_bangs', the bangs in the translated folder will be the same, right?
So the process should be like:

For i in range(5):
  For each source image x:
    randomly sample a reference image y
    translate x using y as reference
  calculate FID
calculate Average FID

So the problem may be that you put sample a reference image y before the loop of source images.

Actually, you need to randomly sample the reference images for each source image. If you sample only one reference image to translate all the source images into 'with_bangs', the bangs in the translated folder will be the same, right?
So the process should be like:

For i in range(5):
  For each source image x:
    randomly sample a reference image y
    translate x using y as reference
  calculate FID
calculate Average FID

So the problem may be that you put sample a reference image y before the loop of source images.

Thank you very much for your patience and help, I will try again as soon as possible and give you feedback.

Sorry for mistakes I made and my misunderstanding of your experiment settings, I think I understand your experiment settings actually now. I randomly sample the reference images for each source image as your proposed logic.
Then I redid the experiments as you said, and get relative more stable results than before like these:

realism:
L: R: G:
group 1:
0: 25.70 25.66
1: 25.60 25.53
2: 25.48 25.56
3: 25.69 25.58
4: 25.53 25.74
avg: 25.60 25.61 0.01
group 2:
0: 25.61 25.60
1: 25.53 25.61
2: 25.60 25.55
3: 25.65 25.66
4: 25.60 25.63
avg: 25.60 25.61 0.01

distanglement:
L: R: G:
group 1:
0: 85.71 84.91
1: 86.57 84.96
2: 86.14 85.51
3: 85.50 85.61
4: 86.56 85.51
avg: 86.10 85.30 0.80
group 2:
0: 85.89 85.87
1: 86.41 85.13
2: 86.11 85.91
3: 85.88 84.57
4: 87.12 85.80
avg: 86.28 85.46 0.82

However, the results are still much worse than the paper's, I think that might be something wrong with my training stage, I think maybe I should re-train my model and have another try.
While, considering that I use exactly the same hardware-conditions and exactly the same training settings, yet the results are worse. There is also a possibility that because the training code has been changed, the previous training settings may not make the current training model fully converge. (In fact, according to the loss curve during training, the adversarial losses'(generator's and discriminator's) curves are quite unstable, yet this is also might because of the characteristics of the GAN structure itself).

In fact, I don't know much about image translation, I'm just a beginner of image translation researchers in a way, I hope you don't get bored because of my ignorance.

It's always encouraged to ask in research.
Can you share the qualitative results of your self-trained checkpoint here?

Many thanks for your help. I have packed some qualitative experiment results and the images of my quantitative experiment (if you needed) in the following Baidu Yun link. Thank you for your willingness to help.
https://pan.baidu.com/s/1r1deZsdbJ4RgFhTXRUKjpQ
Extraction code: HISD
and my checkpoint file(if needed):
https://pan.baidu.com/s/1C6_Pm-gEpwGQFRDaMBDNNg
Extraction code: HISD

The qualitative results seem to be promising.
I calculate FID using StarGANv2's script.
I check the difference between StarGANv2's and pytorch-FID and find that these is a preprocessing in the former one, which is

def get_eval_loader(root, img_size=256, batch_size=32,
                    imagenet_normalize=True, shuffle=True,
                    num_workers=4, drop_last=False):
    print('Preparing DataLoader for the evaluation phase...')
    if imagenet_normalize:
        height, width = 299, 299
        mean = [0.485, 0.456, 0.406]
        std = [0.229, 0.224, 0.225]
    else:
        height, width = img_size, img_size
        mean = [0.5, 0.5, 0.5]
        std = [0.5, 0.5, 0.5]

    transform = transforms.Compose([
        transforms.Resize([img_size, img_size]),
        transforms.Resize([height, width]),
        transforms.ToTensor(),
        transforms.Normalize(mean=mean, std=std)
    ])

    dataset = DefaultDataset(root, transform=transform)
    return data.DataLoader(dataset=dataset,
                           batch_size=batch_size,
                           shuffle=shuffle,
                           num_workers=num_workers,
                           pin_memory=True,
                           drop_last=drop_last)

So there may be a proprecessing (a simple normalization) which you need to add in your code. Let me know the results and I think we are close to make it.

The qualitative results seem to be promising.
I calculate FID using StarGANv2's script.
I check the difference between StarGANv2's and pytorch-FID and find that these is a preprocessing in the former one, which is

def get_eval_loader(root, img_size=256, batch_size=32,
                    imagenet_normalize=True, shuffle=True,
                    num_workers=4, drop_last=False):
    print('Preparing DataLoader for the evaluation phase...')
    if imagenet_normalize:
        height, width = 299, 299
        mean = [0.485, 0.456, 0.406]
        std = [0.229, 0.224, 0.225]
    else:
        height, width = img_size, img_size
        mean = [0.5, 0.5, 0.5]
        std = [0.5, 0.5, 0.5]

    transform = transforms.Compose([
        transforms.Resize([img_size, img_size]),
        transforms.Resize([height, width]),
        transforms.ToTensor(),
        transforms.Normalize(mean=mean, std=std)
    ])

    dataset = DefaultDataset(root, transform=transform)
    return data.DataLoader(dataset=dataset,
                           batch_size=batch_size,
                           shuffle=shuffle,
                           num_workers=num_workers,
                           pin_memory=True,
                           drop_last=drop_last)

So there may be a proprecessing (a simple normalization) which you need to add in your code. Let me know the results and I think we are close to make it.

Sorry for delaying the reply. Actually, without this preprocessing caused the much lower FID results, with StarGANv2's script, the FID results of my latest released results improved to :
group 1:
L R G
realism:
21.27 21.34 0.07
disentanglement:
72.55 72.51 0.04
group2:
L R G
realism:
21.28 21.24 0.04
disentanglement:
72.31 72.33 0.02
compared to paper's results:
Realism:
L:21.37 R:21.49 G:0.12
Disentanglement:
L:71.85 R:71.48 G:0.37
though the results of "disentanglement"'s results are still a little worse, I am not sure about the approximate range of FID fluctuations under normal circumstances, maybe it's acceptable?

I do think this is acceptable.
In the paper, we also discuss about the contradiction point between the Realism and Disentanglement (see Sec 4.3 about model without tag-irrelevant conditions). Therefore achieving better results in both Realism and Disentanglement also surprise me in the beginning.
After all, the differences between the released code and the original one are:

  • the original one does not use ALI in adversarial loss (which you can turn off by set all s[:]=0 in discriminator forwarding).
  • the original one uses tag-irrelevant conditions containing Other Tags (use labels of hair color and bangs for tag glasses as well).

I've change the README to clarify the corrected FID script I use in the quantitative results, thank you for your enthusiastic reproduction!

Thank you for your help again. It's your selfless help that I can successfully reproduce your experiment results.
We communicate in English here for the convenience of other people’s references.
Here I would like to thank you again privately:
感谢一直以来的耐心帮助,诚心祝愿后续科研工作顺利~

你也是~

Many thanks for your help. I have packed some qualitative experiment results and the images of my quantitative experiment (if you needed) in the following Baidu Yun link. Thank you for your willingness to help. https://pan.baidu.com/s/1r1deZsdbJ4RgFhTXRUKjpQ Extraction code: HISD and my checkpoint file(if needed): https://pan.baidu.com/s/1C6_Pm-gEpwGQFRDaMBDNNg Extraction code: HISD

Could you share the images of your quantitative experiment again because the Baidu Yun link is invalid? I am also reproducing the quantitative experiment results in the paper, following your issue but I can not get the result close to the paper.

@oldrive What‘s your detailed setting for your reproduction.

@oldrive What‘s your detailed setting for your reproduction.

config: celeba-hq.yaml
checkpoint: checkpoint_128_celeba-hq.pt
compute_fid_script: use fid.py in stargan2 to compute fid between fake_images and real_images
realism fid of L:
fake_images = [latent_images_0, latent_images_1, latent_images_2, latent_images_3, latent_images_4]
latent_images_i is generated from test_bangs_without accroding to Test_Bangs_without.txt use random latent as guide.
real_images = [test_bangs_with images accrodding to Test_Bangs_with.txt]
realism_latent_fid_average = ( fid(fake_images[0], real_images) + ... + fid(fake_images[4], real_images) ) / 5

realism fid of G:
fake_images = [reference_images_0, reference_images_1, reference_images_2, reference_images_3, reference_images_4]
reference_images_i is generated from all_bangs_with according to Bangs_with.txt and Test_Bangs_with.txt use random reference guide.
real_images = [test_bangs_with images accrodding to Test_Bangs_with.txt]
realism_reference_fid_average = ( fid(fake_images[0], real_images) + ... + fid(fake_images[4], real_images) ) / 5

The result of the realism_fid as follows:
Group 1:
realism_fid_latent_0: 31.692982996524883
realism_fid_latent_1: 31.671476972145367
realism_fid_latent_2: 31.620433186098698
realism_fid_latent_3: 31.629911284997206
realism_fid_latent_4: 31.73387679777522
realism_fid_reference_0: 32.591734278849
realism_fid_reference_1: 32.215290934387426
realism_fid_reference_2: 32.18949934088806
realism_fid_reference_3: 32.287988762946526
realism_fid_reference_4: 32.304219580808336
realism_fid_latent_average: 31.669736247508276
realism_fid_reference_average: 32.31774657957587

Group 2:
realism_fid_latent_0: 31.642293517652654
realism_fid_latent_1: 31.623934807071
realism_fid_latent_2: 31.68461378392377
realism_fid_latent_3: 31.631847657251797
realism_fid_latent_4: 31.67548435280436
realism_fid_reference_0: 32.29246639585722
realism_fid_reference_1: 32.288538090496914
realism_fid_reference_2: 32.11632434611198
realism_fid_reference_3: 32.15312062309697
realism_fid_reference_4: 32.23484964483734
realism_fid_latent_average: 31.651634823740714
realism_fid_reference_average: 32.21705982008008

What's the command you used when you calculate the FID?

What's the command you used when you calculate the FID?

Just like this:
latent_fid_value = calculate_fid_given_paths([real_path, fake_latent_path[i]], args.img_size, args.batch_size)

The "args.img_size" is set to be 128, right?

The "args.img_size" is set to be 128, right?

right.
parser.add_argument('--img_size', type=int, default=128, help='image resolution')

What about the qualitative results.

What about the qualitative results.

the results have replied in above mention.

I mean the visual results.

I mean the visual results.

Oh, I misunderstand your means.

some results of realism_latent_0 are here
0 jpg_output
1 jpg_output
2 jpg_output
some results of realism_reference_0 are here
0 jpg_output
1 jpg_output
2 jpg_output

I mean the visual results.

Every image in a fold has a different style of bangs.

The visual results seems normal. Please change the image size used in FID to 256 or 224. I don't quite remember the setting here, since that the inception network is trained at a specific resolution.

The visual results seems normal. Please change the image size used in FID to 256 or 224. I don't quite remember the setting here, since that the inception network is trained at a specific resolution.

I'll have a try as you said and tell you the results. Thanks for your reply!

The visual results seems normal. Please change the image size used in FID to 256 or 224. I don't quite remember the setting here, since that the inception network is trained at a specific resolution.

Sorry for bothering again. I tried to compute the realism_fid with two groups. Group1 with the argument(--img_size = 256), and compute fid between the fake images(256256, generated with the 256.config and 256.checkpoint) and real images, group2 with the same argument(--img_size = 256), and compute fid between the fake images(128128, generated with the 128.config and 128.checkpoint) and real images, but it seems that the results are getting worse... That is so wired.

realism_fid(256*256 fake_images and real_images, fid(fake_images, real_images, arg.img_size = 256)):
realism_fid_latent_0: 37.70455934722888
realism_fid_reference_0: 38.05122125169506
realism_fid_latent_1: 37.59272856627348
realism_fid_reference_1: 37.81830888013152
realism_fid_latent_2: 37.698022304952914
realism_fid_reference_2: 38.03778528813959
realism_fid_latent_3: 37.610822585752246
realism_fid_reference_3: 38.0628089612687
realism_fid_latent_4: 37.688711544348806
realism_fid_reference_4: 37.91353803968795
realism_fid_latent_average: 37.65896886971126
realism_fid_reference_average: 37.976732484184566

realism_fid(128*128 fake_images and real_images, fid(fake_images, real_images, arg.img_size = 256)):
realism_fid_latent_0: 69.20908546448136
realism_fid_reference_0: 69.23383364990423
realism_fid_latent_1: 69.11336443484716
realism_fid_reference_1: 69.34028775602908
realism_fid_latent_2: 69.18649394941102
realism_fid_reference_2: 69.52593890927548
realism_fid_latent_3: 69.09191563199727
realism_fid_reference_3: 69.40835510741587
realism_fid_latent_4: 69.0797907168618
realism_fid_reference_4: 69.28953132218695
realism_fid_latent_average: 69.13613003951971
realism_fid_reference_average: 69.35958934896232

Could you share some real images in test_bangs_with.txt?

Could you share some real images in test_bangs_with.txt?

There are the first five images in test_bangs_with:
15
17
31
43
44

Screenshot from 2021-11-16 13-59-09

@oldrive I don't know if this is the reason. In my experiments, the real images are also resized to specific resolution first and saved in a folder just like @HyZhu39 did:

I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
Could you have a try?

@oldrive I don't know if this is the reason. In my experiments, the real images are also resized to specific resolution first and saved in a folder just like @HyZhu39 did:

I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
Could you have a try?

Before computing fid use real images, I did not resize them or save them to a folder, I'll have a try.

@oldrive I don't know if this is the reason. In my experiments, the real images are also resized to specific resolution first and saved in a folder just like @HyZhu39 did:

I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
Could you have a try?

Oh, the reason is that, and I get the result of realism fid and disentangle fid closer to the paper.
realism_fid:
realism_fid_latent_0: 20.912922731584082
realism_fid_reference_0: 21.046019767649355
realism_fid_latent_1: 20.76848449633095
realism_fid_reference_1: 21.04662247713575
realism_fid_latent_2: 20.800978320503397
realism_fid_reference_2: 21.0600877899802
realism_fid_latent_3: 20.775910991635065
realism_fid_reference_3: 20.92837823926883
realism_fid_latent_4: 20.68396588649034
realism_fid_reference_4: 20.94170026977707
realism_fid_latent_average: 20.788452485308767
realism_fid_reference_average: 21.004561708762242

disentangle_fid:
disentangle_fid_latent_0: 71.39510730377387
disentangle_fid_reference_0: 70.64971902519095
disentangle_fid_latent_1: 71.06008491519601
disentangle_fid_reference_1: 70.88973207966575
disentangle_fid_latent_2: 71.40558227571222
disentangle_fid_reference_2: 71.33517553604398
disentangle_fid_latent_3: 71.2109615470645
disentangle_fid_reference_3: 71.0546303462186
disentangle_fid_latent_4: 71.48734756970637
disentangle_fid_reference_4: 71.08293051285575
disentangle_fid_latent_average: 71.31181672229059
disentangle_fid_reference_average: 71.00243749999501

Thank you for your patient help and quick reply again. With your help can I reproduce the quantitative experiment results in the paper.
由衷地感谢作者大大的热心帮助,祝愿作者大大今后的科研工作一路顺风~

Ideally it should be the same for these two resizing steps. I think the reason maybe the the transform.resize module. As this link says, when inputing a PIL image, the resize function would use antialias mode by default.
不客气哈,也同样非常感谢关注这篇工作。一切顺利!

@HyZhu39 Hello, how you get the FID between images generated by 5 style codes and the real images? The generated images for 5 style codes should be put into 5 folders as expected and calculate the average FID between each of them and the real images. Each folder has the same number of images as the original source images. For disentanglement in our experiments, the reference-guided style codes are randomly sampled from all images with bangs.

Sorry to disturb you, I am reshowing the experimental results of this paper. There are 568 real images with bangs, and 2432 images without bangs. After translation, I will get 2432 imgs with bangs. May I ask whether I should directly calculate FID for these two photo sets or select 568 images from 2432 images for calculation? Looking forward to your reply.Thank you!

Yes. The FID evaluation separately calculates the distribution mean and var of two folders, so you don't need to worry about the different number of images. @zhushuqi2333

Thank you for your reply~I have carefully read all the answers to this question and conducted relevant experiments. All my experiments are carried out under 128 x 128 pictures. Because I trained the model by using celeba-hq.yaml,the resolution of which is 128 x128
The experimental configuration is as follows:

I have one difference from the above content, which is the calculation of FID -- img_ size=128. All the resolution of real_imgs is 128 X128, and the size of all translated pictures is also 128 X128.
My experimental results are as follows:

realism: disentanglement:
L: 22.63 L: 72.46
R: 21.17 R: 71.63
G: 1.46 G: 0.83

compared to the paper's results:
Realism: Disentanglement:
L:21.37 L:71.85
R:21.49 R:71.48
G:0.12 G:0.37

L is a little bigger than the paper's, G is too big. Can you give some suggestions?Looking forward to your reply~

I think the difference between these two results is acceptable if you only calculate once. You can try:

  1. calculate the average FID of 5 random (different seeds for L and G) results.
  2. use different checkpoint. The latest checkpoint is not always the best.

@zhushuqi2333

Thank you for your reply~Results above are average,I will try many different seeds and use different checkpoint.