jeonggg119/DL_paper

[CV_GAN] Generative Adversarial Nets

jeonggg119 opened this issue · 0 comments

GAN : Generative Adversarial Nets
https://jeonggg119.tistory.com/37

Abstract

  • Estimating Generative models via an Adversarial process
  • Simultaneously training two models (minimax two-player game)
    • Generative model G : capturing data distribution → recovering training data distribution)
    • Discriminative model D : estimating probability that a sample came from training DB rather than G → equal to 1/2
  • G and D are defined by multilayer perceptrons & trained with backprop

1. Introduction

  • The promise of DL : to discover models that represent probability distributions over many kinds of data
  • The most striking success in DL : Discriminative models that map a high dimensional, rich sensory input to a class label
    • based on backprop and dropout
    • using piecewise linear units behaved gradient
  • Deep Generative model : less impact due to..
    • difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation
    • difficulty of leveraging benefits of piecewise linear units
  • GAN : training both models using only backprop and dropout & sampling from G using only forward prop
    • Generative model G : generating samples by passing random noise through a multilayer perceptron
    • Discriminative model D : also defined by a multilayer perceptron
    • No need for Markov chains or inference networks

2. Related work

  • RBMs(restricted Boltzmann machines), DBMs(deep Boltzmann machines) : undirected graphical models with latent variables
  • DBNs(Deep belief networks) : hybrid models containing a single undirected layer and several directed layers
  • Score matching, NCE(noise-contrastive estimation) : criteria that don't approximate or bound log-likelihood
  • GSN(generative stochastic network) : extending generalized DAE -> training G to draw samples from desired distribution

3. Adversarial nets

1) Adversarial modeling (G+D) based on MLPs

  • p_g : G's distribution
  • p_z(z) : Input noise random variables
  • G : differentiable function represented by MLP -> G(z) : mapping to data space -> output : fake img
  • D(x) : probability that x came from the train data rather than p_g from G -> output : single scalar

2) Two-player minimax game with value function V(G,D)

image

  • D : maximize probability of assigning correct label to Training examples & Samples from G
    • D(x)=1, D(G(z))=0
  • G : minimize log(1-D(G(z)))
    • D(G(z))=1
    • Implementation : train G to maximize log(D(G(z))) = stronger gradients early in learning (preventing saturations)

3) Theoretical Analysis

image

  • Training criterion allows one to recover data generating distribution as G and D are given enough capacity

image

  • [Algorithm 1] k steps of optimizing D and 1 step of optimizing G
    • D : being maintained near its optimal solution
    • G : changing slowly enough
  • Loss function for G : min log(1-D(G(z))) => max log(D(z)) for stronger gradients early in training
  • D is trained to discriminate samples from data, converging to D*(x)=P_d(x)/(P_d(x)+P_g(x))
    • D가 Objective function 달성한 optimal state일 때, G가 Objective function 달성하도록 학습
  • ∴ P_g(x) = P_data(x) <=> D(G(z))=1/2

4. Theoretical Results

  • G implicitly defines P_g as distribution of the samples G(z) obtained when z~P_z
  • [Algorithm 1] to converge to a good estimator of P_data
  • Non-parametic : representing a model with infinite capacity by studying convergence in space of probability density func
  • Global optimum for p_g = p_data

4.1 Global Optimality of p_g = p_data

[Proposition 1]
image

  • Optimal D for any given G
  • For G fixed, optimal D is D*(x)=P_d(x)/(P_d(x)+P_g(x))

[Theorem 1]
image
image

  • Global minimum of C(G) = - log4 is achieved if and only if P_g=P_data

4.2 Convergence of Algorithm 1

image

[Proposition 2]

  • If G and D have enough capacity, and at each step of Algorithm 1,
  • D is allowed to reach optimum given G & P_g is updated to improve criterion → P_g = P_data
  • pf) V(G, D) = U(P_g, D) : convex function in P_g
  • Computing a gradient descent update for P_g at optimal D given G
  • With sufficiently small updates of P_g
  • Optimizing θ_g rather than P_g itself
  • Excellent performance of MLP in practice → reasonable model to use despite their lack of theoretical guarantees

5. Experiments

  • Datasets : MNIST, Toronto Face Database(TFD), CIFAR-10
  • G : ReLU + sigmoid activations / Dropout and other noise at intermediate layers / Noise as input to bottommost layer
  • D : Maxout activations / Dropout

[Table 1]
image

  • Estimation method : Gaussian Parzen window-based log-likelihood estimation for probability of test data

[Figure 2]
image

  • Rightmost column : nearest neighboring training sample → Model has not memorized training set
  • Samples are fair random draws (Not cherry-picked)
  • Markov chain mixing Sampling process X → Samples are uncorrelated

[Figure 3]
image

  • Linear Interpolation bw coordinates in z space of full model

6. Advantages and disadvantages

1) Disadvantages

  • No explicit representation of P_g(x)
  • D must be synchronized well with G during training (G must be trained too much without updating D)
  • G collapses too many values of z to same value of x to have enough diversity to model P_data

2) Advantages

(1) Computational Advantages

  • Markov chains are never needed / Only backprop is used / No Inference is needed
  • Wide variety of functions can be incorporated into model

(2) Statistical Advantages from G

  • Not being updated directly with data, but only with gradients flowing through D
  • (= Components of input are not copied directly into G's parameters)
  • Representing very sharp, even degenerating distributions

7. Conclusions and future work

  • conditional GAN p(x|c) : adding c as input to both G and D
  • Learned approximate inference : training auxiliary network to predict z given x
    • Similar to inference net trained by wake-sleep algorithm
    • Advantage : inference net trained for a fixed G after G has finished training
  • All conditionals GAN p(x_S|x_S/) : S is a subset of indices of x by training family of conditional models that share params
    • To implement a stochastic extension of deterministic MP-DBM
  • Semi-supervised learning : when limited labeled data is available
  • Efficiency improvements : training accelerated by coordinating G and D or determining better distributions to sample z