sbarratt/inception-score-pytorch

Different inception scores as compared to BigGAN paper.

chirag126 opened this issue · 7 comments

Hello,
I was going through your paper and found out that you reported the inception scores for ImageNet validation to be 63.702±7.869 (for 299x299 image size)but the BigGAN paper reports it to be 166.5 (for 128x128 image size). Can you comment on the discrepancy here?

k-eak commented

I think you need to shuffle the data as the score is calculated for 50 images each time. If all 50 images belong to the same class, the score will be much lower.

Do you mean that I need to add shuffle=True at the original code for dataloader?

I can only get a 9.6 IS for original cifar dataset, which should be 11.3 Here is the code to reproduce. Any suggestion?
Capture

@YLJALDC
The transforms.Normalize matters in this case. I use the following codes to achieve 10.54+0.15 IS for original cifar dataset.

import torch
from torch import nn
from torch.autograd import Variable
from torch.nn import functional as F
import torch.utils.data

from torchvision.models import inception_v3

import numpy as np
from scipy.stats import entropy

def inception_score(imgs, cuda=True, batch_size=32, resize=False, splits=1):
    """Computes the inception score of the generated images imgs

    imgs -- Torch dataset of (3xHxW) numpy images normalized in the range [-1, 1]
    cuda -- whether or not to run on GPU
    batch_size -- batch size for feeding into Inception v3
    splits -- number of splits
    """
    N = len(imgs)

    assert batch_size > 0
    assert N > batch_size

    # Set up dataloader
    dataloader = torch.utils.data.DataLoader(imgs, shuffle=True, batch_size=batch_size)
    print ("INFO : Dataset ready ...")

    # Load inception model
    inception_model = inception_v3(pretrained=True, transform_input=False)
    inception_model = inception_model.cuda()
    print ("INFO : Inception model ready ...")

    inception_model.eval();
    up = nn.Upsample(size=(299, 299), mode='bilinear')

    def get_pred(x):
        if resize:
            x = up(x)
        x = inception_model(x)
        return F.softmax(x, dim=1).data.cpu().numpy()

    # Get predictions
    preds = np.zeros((N, 1000))
    print ("INFO : Extra memory allocated ...")

    for i, batch in enumerate(dataloader, 0):
        batch = batch.cuda()
        # batchv = Variable(batch)
        batch_size_i = batch.size()[0]
        preds[i*batch_size:i*batch_size+batch_size_i] = get_pred(batch)
        if i % 1000 == 0:
            print('==> Processing ' + str(i) + 'th. batch.')

    # Now compute the mean kl-div
    scores = []

    for k in range(splits):
        part = preds[k * (N // splits): (k+1) * (N // splits), :]
        kl = part * (np.log(part) - np.log(np.expand_dims(np.mean(part, 0), 0)))
        kl = np.mean(np.sum(kl, 1))
        scores.append(np.exp(kl))
    return np.mean(scores), np.std(scores)

if __name__ == '__main__':
    class IgnoreLabelDataset(torch.utils.data.Dataset):
        def __init__(self, orig):
            self.orig = orig

        def __getitem__(self, index):
            return self.orig[index][0]

        def __len__(self):
            return len(self.orig)

    import torchvision.datasets as dset
    import torchvision.transforms as transforms

    cifar = dset.CIFAR10(root='data/', download=True,
                             transform=transforms.Compose([
                                 transforms.Scale(32),
                                 transforms.ToTensor(),
                                 # transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
                                 transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
                             ])
    )

    IgnoreLabelDataset(cifar)

    print ("INFO : Calculating Inception Score ...")
    print (inception_score(IgnoreLabelDataset(cifar), cuda=True, batch_size=8, resize=True, splits=10))

@HolmesShuan is 10.54 accurate enough score? I was reading that tensorflow implementation has something around 11, is the difference of 0.5 small enough?

@vibss2397 I still recommend the TensorFlow implementation. The IS score on ImageNet seems to be more inconsistent with the reported results. I run the CIFAR experiments multiple times and notice that the score is always 0.5 smaller than the baseline.