[CV_GAN] Generative Adversarial Nets
jeonggg119 opened this issue · 0 comments
jeonggg119 commented
GAN : Generative Adversarial Nets
https://jeonggg119.tistory.com/37
Abstract
- Estimating Generative models via an Adversarial process
- Simultaneously training two models (minimax two-player game)
- Generative model G : capturing data distribution → recovering training data distribution)
- Discriminative model D : estimating probability that a sample came from training DB rather than G → equal to 1/2
- G and D are defined by multilayer perceptrons & trained with backprop
1. Introduction
- The promise of DL : to discover models that represent probability distributions over many kinds of data
- The most striking success in DL : Discriminative models that map a high dimensional, rich sensory input to a class label
- based on backprop and dropout
- using piecewise linear units behaved gradient
- Deep Generative model : less impact due to..
- difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation
- difficulty of leveraging benefits of piecewise linear units
- GAN : training both models using only backprop and dropout & sampling from G using only forward prop
- Generative model G : generating samples by passing random noise through a multilayer perceptron
- Discriminative model D : also defined by a multilayer perceptron
- No need for Markov chains or inference networks
2. Related work
- RBMs(restricted Boltzmann machines), DBMs(deep Boltzmann machines) : undirected graphical models with latent variables
- DBNs(Deep belief networks) : hybrid models containing a single undirected layer and several directed layers
- Score matching, NCE(noise-contrastive estimation) : criteria that don't approximate or bound log-likelihood
- GSN(generative stochastic network) : extending generalized DAE -> training G to draw samples from desired distribution
3. Adversarial nets
1) Adversarial modeling (G+D) based on MLPs
- p_g : G's distribution
- p_z(z) : Input noise random variables
- G : differentiable function represented by MLP -> G(z) : mapping to data space -> output : fake img
- D(x) : probability that x came from the train data rather than p_g from G -> output : single scalar
2) Two-player minimax game with value function V(G,D)
- D : maximize probability of assigning correct label to Training examples & Samples from G
- D(x)=1, D(G(z))=0
- G : minimize log(1-D(G(z)))
- D(G(z))=1
- Implementation : train G to maximize log(D(G(z))) = stronger gradients early in learning (preventing saturations)
3) Theoretical Analysis
- Training criterion allows one to recover data generating distribution as G and D are given enough capacity
- [Algorithm 1] k steps of optimizing D and 1 step of optimizing G
- D : being maintained near its optimal solution
- G : changing slowly enough
- Loss function for G : min log(1-D(G(z))) => max log(D(z)) for stronger gradients early in training
- D is trained to discriminate samples from data, converging to D*(x)=P_d(x)/(P_d(x)+P_g(x))
- D가 Objective function 달성한 optimal state일 때, G가 Objective function 달성하도록 학습
- ∴ P_g(x) = P_data(x) <=> D(G(z))=1/2
4. Theoretical Results
- G implicitly defines P_g as distribution of the samples G(z) obtained when z~P_z
- [Algorithm 1] to converge to a good estimator of P_data
- Non-parametic : representing a model with infinite capacity by studying convergence in space of probability density func
- Global optimum for p_g = p_data
4.1 Global Optimality of p_g = p_data
- Optimal D for any given G
- For G fixed, optimal D is D*(x)=P_d(x)/(P_d(x)+P_g(x))
- Global minimum of C(G) = - log4 is achieved if and only if P_g=P_data
4.2 Convergence of Algorithm 1
[Proposition 2]
- If G and D have enough capacity, and at each step of Algorithm 1,
- D is allowed to reach optimum given G & P_g is updated to improve criterion → P_g = P_data
- pf) V(G, D) = U(P_g, D) : convex function in P_g
- Computing a gradient descent update for P_g at optimal D given G
- With sufficiently small updates of P_g
- Optimizing θ_g rather than P_g itself
- Excellent performance of MLP in practice → reasonable model to use despite their lack of theoretical guarantees
5. Experiments
- Datasets : MNIST, Toronto Face Database(TFD), CIFAR-10
- G : ReLU + sigmoid activations / Dropout and other noise at intermediate layers / Noise as input to bottommost layer
- D : Maxout activations / Dropout
- Estimation method : Gaussian Parzen window-based log-likelihood estimation for probability of test data
- Rightmost column : nearest neighboring training sample → Model has not memorized training set
- Samples are fair random draws (Not cherry-picked)
- Markov chain mixing Sampling process X → Samples are uncorrelated
- Linear Interpolation bw coordinates in z space of full model
6. Advantages and disadvantages
1) Disadvantages
- No explicit representation of P_g(x)
- D must be synchronized well with G during training (G must be trained too much without updating D)
- G collapses too many values of z to same value of x to have enough diversity to model P_data
2) Advantages
(1) Computational Advantages
- Markov chains are never needed / Only backprop is used / No Inference is needed
- Wide variety of functions can be incorporated into model
(2) Statistical Advantages from G
- Not being updated directly with data, but only with gradients flowing through D
- (= Components of input are not copied directly into G's parameters)
- Representing very sharp, even degenerating distributions
7. Conclusions and future work
- conditional GAN p(x|c) : adding c as input to both G and D
- Learned approximate inference : training auxiliary network to predict z given x
- Similar to inference net trained by wake-sleep algorithm
- Advantage : inference net trained for a fixed G after G has finished training
- All conditionals GAN p(x_S|x_S/) : S is a subset of indices of x by training family of conditional models that share params
- To implement a stochastic extension of deterministic MP-DBM
- Semi-supervised learning : when limited labeled data is available
- Efficiency improvements : training accelerated by coordinating G and D or determining better distributions to sample z