openai/improved-gan

Inception Score calculation

Opened this issue · 5 comments

The Inception Score calculation has 3 mistakes.

It uses an outdated Inception network that in fact outputs a 1008-vector of classes (see the following GitHub issue):

It turns out that the 1008 size softmax output is an artifact of dimension back-compatibility with a older, Google-internal system. Newer versions of the inception model have 1001 output classes, where one is an "other" class used in training. You shouldn't need to pay any attention to the extra 8 outputs.

Fix: See link for the new inception Model.

It calculates the kl-divergence directly using logs, which leads to numerical instabilities (can output nan instead of inf). Instead, scipy.stats.entropy should be used.

kl = part * (np.log(part) - np.log(np.expand_dims(np.mean(part, 0), 0)))
kl = np.mean(np.sum(kl, 1))

Fix: Replace the above with something along the lines of the following:

py = np.mean(part, axis=0)
l = np.mean([entropy(part[i, :], py) for i in range(part.shape[0])])

It calculates the mean of the exponential of the split rather than the exponential of the mean:

Here is the code in inception_score.py which does this:

    scores.append(np.exp(kl))
return np.mean(scores), np.std(scores)

This is clearly problematic, as can easily be seen in a very simple case with a x~Bernoulli(0.5) random variable that E[e^x] = .5(e^(0) + e^(1)) != e^(.5(0)+.5(1)) = e^[E[x]]. This can further be seen with an example w/ a uniform random variable, where the split-mean over-estimates the exponential.

import numpy as np
data = np.random.uniform(low=0., high=15., size=1000)
split_data = np.split(data, 10)
np.mean([np.exp(np.mean(x)) for x in split_data]) # 1608.25
np.exp(np.mean(data)) # 1477.25

Fix: Do not calculate the mean of the exponential of the split, and instead calculate the exponential of the mean of the KL-divergence over all 50,000 inputs.

The first point is an important issue. For the third point, note that they do NOT intend to calculate the inception score (IS) for 50,000 inputs. Rather, they spit 50,000 samples into 10 splits each with 5,000 samples. They then calculate IS for each split and return the average IS over splits. So the code is correct.

So,which version is correct?

Hi, I have rewritten the code for calculating Inception Score, taking the first problem into consideration:https://github.com/tsc2017/inception-score

As to the second problem, since the softmax function hardly outputs a 0 for a category, which means the conditional and marginal distribution of y is supported on all the 1000 classes, it is unlikely to get a 0*log(0), log(∞) or divide-by-0 error, and I do not observe any numerical instability, neither with the old nor with my new implementation.

Lastly, since the inception score is approximated by a statistic of a sample, just make sure the sample size is big enough. The common use of 50000 images in 10 splits seems acceptable. Take the CIFAR-10 training set images as an example, the inception score is around 11.34 in 1 split and 11.31±0.08 if in 10 splits.

Can anyone tell me where can I find the material about the inception score?

@lipanpeng https://arxiv.org/abs/1801.01973

@xunhuang1995 the third point is valid, because 5k splits might be too small to adequately represent 1k classes. And as they show in the paper, IS changes depending on the size of the split.