How to combine wgan and spectral norm?

Question

How to combine wgan and spectral norm?

zhangqianhui opened this issue 6 years ago · 58 comments

Your spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint = 1. So we consider whether to combine the wasserstein gan with spectral normalization or not?

So, we have done some related experiments using wgan with sn normalization(remove gradient penalty). However, The network is very unstable and hard to generate very high-quality samples.

We hope to know how to combine wgan and your sn normalization without gradient penalty?

zhangqianhui commented 6 years ago

@takerum

zhangqianhui commented 6 years ago

thank you

zhangqianhui commented 6 years ago

Thanks!

zhangqianhui commented 6 years ago

thanks

Answer 1 · 2018-05-29T07:32:41.000Z

Hi,

We have tried the combination of wgan loss and spectral normalization, but it does not work.
We are not sure why that happens, and also happy if you have ideas on that problem!

Answer 2 · 2018-05-29T09:13:03.000Z

Recently, I combine wgan loss and spectral norm, and got a better result than before.
Some changes in our experiments.

(1) D network, using spectral norm, but remove fully_connect layers
(2) Using RMSprop instead of Adam.
(3) Add a regu term for D loss(proposed by pg-gan) to keep the output values from drifting too far away from zero:

0.0001 * tf.reduce_mean(tf.square(self.D_pro_logits))

You can try, these are for reference only.

Answer 3 · 2018-05-29T09:22:38.000Z

If necessary, I will public this code.
--OO--

Answer 4 · 2018-05-29T09:34:49.000Z

Thanks so much! and there perhaps are some people who have the same issue, so I appreciate if you make your implementation public.

Answer 5 · 2018-05-30T01:36:58.000Z

@takerum Ask you some questions:

Table 2 in SN-GAN paper, How do you calculate the FID score of real data? (7.8)
Table 2 How many generated samples are used for getting FID or Inception scores? 5000 or 50000?

Answer 6 · 2018-06-01T02:02:11.000Z

@takerum Your paper used 5000 samples to compute FID and Inception score. But in improved-gan paper, they use 50000 samples to get Inceptions scores.

????

Answer 7 · 2018-06-01T05:23:27.000Z

Table 2 in SN-GAN paper, How do you calculate the FID score of real data? (7.8)

We sample 10000 images on test set and 5000 images on training set and calculate FID on the two sets of the images.

Table 2 How many generated samples are used for getting FID or Inception scores? 5000 or 50000?

Your paper used 5000 samples to compute FID and Inception score. But in improved-gan paper, they use 50000 samples to get Inceptions scores.

Both of the original paper and our paper use 50,000 samples for calculating the "mean" and "std" of the inception scores.
The original paper and our paper calculate inception score with 5,000 samples and repeated 10 times to estimate the mean and variance of inception scores on each independently generated set of images.
For FID, we calculate it with 5,000 samples and report the value, because we found that the variation of FID within independent sets is very small compared to the value of FID.

Answer 8 · 2018-06-01T06:48:32.000Z

ok， thanks

Answer 9 · 2018-06-01T06:50:44.000Z

I got 7.78 scores of FID for real data.

Answer 10 · 2018-06-04T15:43:51.000Z

For FID, we calculate it with 5,000 samples and report the value, because we found that the variation of FID within independent sets is very small compared to the value of FID.

Friendly PSA: don't do this. The FID estimator has very low variance across runs, but very strong bias, especially due to the sample size; the numbers can only be compared with a consistent sample size, and most people use 50,000. Check out section 4 (starting page 7) and appendix D (starting page 30) of our paper Demystifying MMD GANs for more about this.

Answer 11 · 2018-06-07T09:10:14.000Z

@takerum Now, I use tensorflow to implement dcgan+sn.
The training parameter is

batch_size=16; softplus function for standard loss; learn_rate=0.002; sn=True; not resnet; n_critic=5;
beta2=0.9; iterations = 100,000; dataset=cifar10; The architecture is same to Table 3(a)

I got 7.06+-0.08 inception scores, which is lower than 7.42+-0.08 of table 2 in your paper.
Can you tell me the reason?

Answer 12 · 2018-06-07T09:18:14.000Z

So, I want to ask you some question about the experiments corresponding to inception score of sn-gans in cifar10(7.42+-0.08) ?

(1)What is the number of all iterations？
(2)Are you using decay of learning rate?
(3)The init methods of weights.

Answer 13 · 2018-06-07T11:38:49.000Z

please look at https://github.com/pfnet-research/chainer-gan-lib.
This repository includes the reproducing code of SN-GANs on CIFAR-10 dataset.

Answer 14 · 2018-06-07T12:30:25.000Z

@takerum , using the default parameters?

def main():
    parser = argparse.ArgumentParser(description='Train script')
    parser.add_argument('--algorithm', '-a', type=str, default="dcgan", help='GAN algorithm')
    parser.add_argument('--architecture', type=str, default="dcgan", help='Network architecture')
    parser.add_argument('--batchsize', type=int, default=64)
    parser.add_argument('--max_iter', type=int, default=100000)
    parser.add_argument('--gpu', '-g', type=int, default=0, help='GPU ID (negative value indicates CPU)')
    parser.add_argument('--out', '-o', default='result', help='Directory to output the result')
    parser.add_argument('--snapshot_interval', type=int, default=10000, help='Interval of snapshot')
    parser.add_argument('--evaluation_interval', type=int, default=10000, help='Interval of evaluation')
    parser.add_argument('--display_interval', type=int, default=100, help='Interval of displaying log to console')
    parser.add_argument('--n_dis', type=int, default=5, help='number of discriminator update per generator update')
    parser.add_argument('--gamma', type=float, default=0.5, help='hyperparameter gamma')
    parser.add_argument('--lam', type=float, default=10, help='gradient penalty')
    parser.add_argument('--adam_alpha', type=float, default=0.0002, help='alpha in Adam optimizer')
    parser.add_argument('--adam_beta1', type=float, default=0.0, help='beta1 in Adam optimizer')
    parser.add_argument('--adam_beta2', type=float, default=0.9, help='beta2 in Adam optimizer')
parser.add_argument('--output_dim', type=int, default=256, help='output dimension of the discriminator (for cramer GAN)')

Answer 15 · 2018-06-07T12:36:58.000Z

yes we used the default parameters other than those we specified here: https://github.com/pfnet-research/chainer-gan-lib/blob/master/example.sh#L8

Answer 16 · 2018-07-19T12:24:47.000Z

@zhangqianhui Thanks for the suggestion! Should we include all below? Or which one is the most important for that wgan works with spectral norm?

(1) D network, using spectral norm, but remove fully_connect layers
(2) Using RMSprop instead of Adam.
(3) Add a regu term for D loss(proposed by pg-gan) to keep the output values from drifting too far away from

Answer 17 · 2018-07-19T12:26:51.000Z

Sorry, I am not sure. You can try.

Answer 18 · 2018-07-23T02:30:49.000Z

@takerum hello, can you help me to check this implement of spectral norma function below?

def spectral_norm(w, iteration= 1):

    w_shape = w.shape.as_list()
    w = tf.reshape(w, [-1, w_shape[-1]])
    # w = tf.reshape(w, [1, w.shape.as_list()[0] * w.shape.as_list()[1]])

    u = tf.get_variable("u", [1, w.shape.as_list()[-1]], initializer=tf.truncated_normal_initializer(), trainable=False)
    u_hat = u
    v_hat = None

    for i in range(iteration):

        """
        power iteration
        Usually iteration = 1 will be enough
        """
        v_ = tf.matmul(u_hat, tf.transpose(w))
        v_hat = _l2normalize(v_)
        u_ = tf.matmul(v_hat, w)
        u_hat = _l2normalize(u_)

    #real_sn = tf.svd(w, compute_uv=False)[...,0]
    sigma = tf.matmul(tf.matmul(v_hat, w), tf.transpose(u_hat))
    w_norm = w / sigma
    #Get the real spectral norm
    #real_sn_after = tf.svd(w_norm, compute_uv=False)[..., 0]

    #frobenius norm
    #f_norm = tf.norm(w, ord='fro', axis=[0, 1])

    #tf.summary.scalar("real_sn", real_sn)
    tf.summary.scalar("powder_sigma", tf.reduce_mean(sigma))
    #tf.summary.scalar("real_sn_afterln", real_sn_after)
    #tf.summary.scalar("f_norm", f_norm)

    with tf.control_dependencies([u.assign(u_hat)]):
        w_norm = tf.reshape(w_norm, w_shape)

    return w_norm

Answer 19 · 2018-07-23T02:32:18.000Z

I can not find the problem, but hard to get the same scores using the default hyper-paramters that you mentioned above?

Answer 20 · 2018-07-23T04:45:19.000Z

@zhangqianhui I think you should try to add u_hat = tf.stop_gradient(u_hat) and v_hat = tf.stop_gradient(v_hat) to avoid the gradient from v_hat and u_hat to w. Like the code below from @takerum, _v and _u have no gradient to W, but sigma has.

My implementation: https://github.com/LynnHo/GAN-Techniques-Tensorflow/blob/master/tflib/layers/layers.py

Answer 21 · 2018-07-23T05:02:28.000Z

@zhangqianhui This is my SN implementation in TF.

def sn(W, collections=None, seed=None, return_norm=False, name='sn'):
    shape = W.get_shape().as_list()
    if len(shape) == 1:
        sigma = tf.reduce_max(tf.abs(W))
    else:
        if len(shape) == 4:
            _W = tf.reshape(W, (-1, shape[3]))
            shape = (shape[0] * shape[1] * shape[2], shape[3])
        else:
            _W = W
        u = tf.get_variable(
            name=name + "_u",
            shape=(FLAGS.num_sn_samples, shape[0]),
            initializer=tf.random_normal_initializer,
            collections=collections,
            trainable=False
        )

        _u = u
        for _ in range(FLAGS.Ip_sn):
            _v = tf.nn.l2_normalize(tf.matmul(_u, _W), 1)
            _u = tf.nn.l2_normalize(tf.matmul(_v, tf.transpose(_W)), 1)
        _u = tf.stop_gradient(_u)
        _v = tf.stop_gradient(_v)
        sigma = tf.reduce_mean(tf.reduce_sum(_u * tf.transpose(tf.matmul(_W, tf.transpose(_v))), 1))
        update_u_op = tf.assign(u, _u)
        with tf.control_dependencies([update_u_op]):
            sigma = tf.identity(sigma)

    if return_norm:
        return W / sigma, sigma
    else:
        return W / sigma

Answer 22 · 2018-07-23T06:35:14.000Z

@LynnHo @takerum thanks

Answer 23 · 2018-08-03T11:17:36.000Z

@zhangqianhui
您好，您是使用WGAN损失函数Ex∼qdata[D(x)]−Ez∼p(z)[D(G(z))] 和在判别器中使用谱归一化？这样进行有效果吗？
之前我试过直接在WGAN-GP加入谱归一化，结果生成器和判别器的loss都是nan。

Answer 24 · 2018-08-03T12:00:33.000Z

@IPNUISTlegal
It works in my experiments, but could not get the higher inception scores than sn-gan

Answer 25 · 2018-08-04T02:07:36.000Z

@zhangqianhui
@takerum

i apply hinge loss and spectral normalization to my experiment ,which actually outperform the WGAN-GP loss function with the same network structure.
a little pity， the discriminator loss does not convergence to a certain value and crazy up and down swing！
why？confused me many days
thx！(#^.^#)

Answer 26 · 2018-08-05T02:17:17.000Z

Is it the curve of D loss using hinge loss ?

Answer 27 · 2018-08-05T05:59:38.000Z

@zhangqianhui
yeah，it is the D loss using hinge loss and does not convergence to a certain value.
why?
thx

Answer 28 · 2018-08-05T09:17:10.000Z

@IPNUISTlegal I think it is the normal curve. How is the quality the generated samples?

Answer 29 · 2018-08-06T01:28:54.000Z

@zhangqianhui
that,combine hinge loss and apply SN to discriminator network , a little outperforms the WGAN-GP loss function with the same network structure.
why it is normal curve?
to my known, D loss value should decrease as the number of iterations increases , just like the G loss,
i am noob in GAN. thx!

Answer 30 · 2018-08-06T12:49:23.000Z

I am confused that your problem, you can send your wechat id to my gmail(zhang163220@gmail.com).

Answer 31 · 2018-08-10T01:25:15.000Z

@takerum hello, I ran your code using this parameter and got 7.24944 inception score, lower than 7.50 . Is it correct? Do you run many times to use the highest score?

The parameters:

def main():
    parser = argparse.ArgumentParser(description='Train script')
    parser.add_argument('--algorithm', '-a', type=str, default="stdgan", help='GAN algorithm')
    parser.add_argument('--architecture', type=str, default="sndcgan", help='Network architecture')
    parser.add_argument('--batchsize', type=int, default=64)
    parser.add_argument('--max_iter', type=int, default=100000)
    parser.add_argument('--gpu', '-g', type=int, default=0, help='GPU ID (negative value indicates CPU)')
    parser.add_argument('--out', '-o', default='result', help='Directory to output the result')
    parser.add_argument('--snapshot_interval', type=int, default=10000, help='Interval of snapshot')
    parser.add_argument('--evaluation_interval', type=int, default=10000, help='Interval of evaluation')
    parser.add_argument('--display_interval', type=int, default=100, help='Interval of displaying log to console')
    parser.add_argument('--n_dis', type=int, default=5, help='number of discriminator update per generator update')
    parser.add_argument('--gamma', type=float, default=0.5, help='hyperparameter gamma')
    parser.add_argument('--lam', type=float, default=10, help='gradient penalty')
    parser.add_argument('--adam_alpha', type=float, default=0.0002, help='alpha in Adam optimizer')
    parser.add_argument('--adam_beta1', type=float, default=0.0, help='beta1 in Adam optimizer')
    parser.add_argument('--adam_beta2', type=float, default=0.9, help='beta2 in Adam optimizer')
    parser.add_argument('--output_dim', type=int, default=256, help='output dimension of the discriminator (for cramer GAN)')

The scores:

0.926269    1.18002     7.24944         0.108181       25.3894     
     total [##################################################] 100.00%
this epoch [..................................................]  0.00%
    100000 iter, 640 epoch / 100000 iterations
      1.33 iters/sec. Estimated time to finish: 0:00:00.

Answer 32 · 2018-08-10T05:58:09.000Z

I tried only once and just reported it.
The hyper-parameters that achieve 7.5 are alpha=0.0002, beta1=0.5, beta2=0.999 (setting C in the paper), which seem to be different from the ones you set.

Answer 33 · 2018-08-10T06:49:57.000Z

ok, I will try again.

Answer 34 · 2018-08-28T02:09:11.000Z

@takerum , Hello, when training sn_gan on cifar10 dataset, just use the training data of cifar10 ? I found using the training data and test data for training can get more high inception scores.

Answer 35 · 2018-08-28T06:00:47.000Z

Yes, only the training data is used for the training.

Answer 36 · 2018-08-31T03:46:28.000Z

@takerum How many is max_iter on STL dataset in your paper.

Answer 37 · 2018-08-31T03:55:51.000Z

what does 'max_iter' refer to? the number of iterations?

Answer 38 · 2018-08-31T04:36:54.000Z

yes. Takeru Miyato <notifications@github.com> 于2018年8月31日周五上午11:55写道：

…

what does 'max_iter' refer to? the number of iterations? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHkijpmuW1uAd2Bp5cvqv0vuAu5AkIuhks5uWLPIgaJpZM4UIOx7> .

Answer 39 · 2018-11-08T06:12:42.000Z

Hi, the code combining wgan loss and spectral norm is available now?

Answer 40 · 2018-11-14T13:24:18.000Z

@HuaqiangWei sorry, no. I think this combination make no sense

Answer 41 · 2018-11-14T13:34:39.000Z

@HuaqiangWei sorry, no. I think this combination make no sense

Thank you for your answer. But according to the paper, isn't spectral normalization a substitute for gradient punishment in WGAN-GP?

Answer 42 · 2018-11-22T10:44:09.000Z

@HuaqiangWei sorry, no. I think this combination make no sense

Hi, In "Sn for GANs", It is shown that there are improvements, when SN is combined with WGAN-GP. Since SN achieves L-1 constraints, theoretically speaking, why SN with vanilla WGAN makes no sense? @zhangqianhui could you please explain?

Answer 43 · 2018-11-22T12:02:16.000Z

It is hard to say. Yes you can use SN to achieve the constraints, but SN with vanilla WGAN is hard to get more high-quality samples generation than vanilla gan with sn in my experiments.

Answer 44 · 2018-11-22T12:39:18.000Z

It is hard to say. Yes you can use SN to achieve the constraints, but SN with vanilla WGAN is hard to get more high-quality samples generation than vanilla gan with sn in my experiments.

Yeah, actually I have been trying SN + vanilla WGAN for 1 month with frustration. Generally the results are worse than WGAN-GP.

Maybe @takerum has an explanation in a theoretical aspect?

Answer 45 · 2018-12-03T02:22:32.000Z

@zhangqianhui
I don't quite understand 'combine wgan and spectral norm'.
Actually I think currently the D (or critic) is approximating the wasserstein distance between p_r and p_g,
isn't it?

I just simply glimpse the sn code and I assume that it doesn't use cross entropy loss and doesn't use sigmoid at the end of D neither.
If so, I think sn-gan is optimizing the Wasserstein distance, the same with WGAN.

Answer 46 · 2018-12-03T09:31:53.000Z

@zhangqianhui
I don't quite understand 'combine wgan and spectral norm'.
Actually I think currently the D (or critic) is approximating the wasserstein distance between p_r and p_g,
isn't it?

I just simply glimpse the sn code and I assume that it doesn't use cross entropy loss and doesn't use sigmoid at the end of D neither.
If so, I think sn-gan is optimizing the Wasserstein distance, the same with WGAN.

In the original paper, they didn't use vanilla WGAN loss + SN (when they did, it performs worse). What they tried is WGAN+GP+SN

Answer 47 · 2018-12-25T09:53:04.000Z

@zhangqianhui This is my SN implementation in TF.

def sn(W, collections=None, seed=None, return_norm=False, name='sn'):
    shape = W.get_shape().as_list()
    if len(shape) == 1:
        sigma = tf.reduce_max(tf.abs(W))
    else:
        if len(shape) == 4:
            _W = tf.reshape(W, (-1, shape[3]))
            shape = (shape[0] * shape[1] * shape[2], shape[3])
        else:
            _W = W
        u = tf.get_variable(
            name=name + "_u",
            shape=(FLAGS.num_sn_samples, shape[0]),
            initializer=tf.random_normal_initializer,
            collections=collections,
            trainable=False
        )

        _u = u
        for _ in range(FLAGS.Ip_sn):
            _v = tf.nn.l2_normalize(tf.matmul(_u, _W), 1)
            _u = tf.nn.l2_normalize(tf.matmul(_v, tf.transpose(_W)), 1)
        _u = tf.stop_gradient(_u)
        _v = tf.stop_gradient(_v)
        sigma = tf.reduce_mean(tf.reduce_sum(_u * tf.transpose(tf.matmul(_W, tf.transpose(_v))), 1))
        update_u_op = tf.assign(u, _u)
        with tf.control_dependencies([update_u_op]):
            sigma = tf.identity(sigma)

    if return_norm:
        return W / sigma, sigma
    else:
        return W / sigma

@takerum
hi. I have some questions about the code.
_W.shape = (shape[0] * shape[1] * shape[2], shape[3])
_u.shape = (FLAGS.num_sn_samples, shape[0])
tf.matmul(_u, _W)
This formula cannot be calculated

Answer 48 · 2019-01-19T12:06:55.000Z

Just put spectral norm in the generator too and it works. No gradient penalty needed.

Answer 49 · 2019-03-12T11:12:11.000Z

@dougsouza
What kind of CNN architecture do you use?

Answer 50 · 2019-03-12T16:48:44.000Z

@w86763777, I used the same network as SNGAN + Self Attention. I did not pursue much, but noticed that using SN on G helps to stabilize training. It is also a good idea to get rid of fully connected layers and use a smaller learning rate. Even though training is "stable", results are not very good and some spikes in the loss sometimes lead to collapse.

Answer 51 · 2019-07-08T02:34:15.000Z

Hi guys, I kind of have this feeling that equation 7 in this paper (spectral normalization) requires a stronger necessary condition than the one used in the gradient penalty paper (Proposition 1 in improved training of wasserstein gan). Equation 7 only makes sense when equation 1 holds, which however might not be the real case in more complexed generator models. Perhaps this is the reason why spectral norm cannot simply be used to replace gradient penalty?

Answer 52 · 2021-07-19T04:23:21.000Z

@IPNUISTlegal
It works in my experiments, but could not get the higher inception scores than sn-gan

Happen to see your post here. I never obtain success on WGAN loss with SN. Do you obtain the results with other tricks such as the regularization of D loss (which is similar to GP) you mentioned above?

Answer 53 · 2021-11-21T11:46:38.000Z

I am also interested if anyone has achieved good results with just SN on WGAN.

I seem so see it in literature and blog posts as a good candidate to GP, but am unable to find any good code example where the Discriminator/Critic converges nicely.