casperkaae/parmesan

Reproduce results from sec. 6.1 in "Variational inference using normalizing flows"

casperkaae opened this issue · 23 comments

As discussed in #21 it would be nice to reproduce the results from sec. 6.1 in the "Variational inference using normalizing flows" paper by Rezende et al.

I would guess the approach is:

  • sample input: x = q0 = N(z; mu, sigma^2 I)
  • run through mapping and normalizing flow to get p(U_hat)
  • update params using squared error between U and U_hat

I was thinking something similar

  • Generate input to the network from the initial distribution: z0 ~ q0(z) = N(z; mu, sigma^2 I). However where mu and sigma^2 should come from wasn't completely clear to me. The typical VAE way of estimating them from data with a MLP looks tricky because then we'd have to first sample the target distribution. Maybe they can just be fixed to something "reasonable" by hand looking at the target distribution. If the NF is powerful enough, this shouldn't matter too much anyways.
  • Run data through the normalizing flow layers and obtain the transformed distribution log qK(z) = log q0(z) - sum_k=1^K log |1 + u_k^T \psi_k(z_{k-1})|
  • Train model by minimizing KL-divergence between transformed distribution qK(z) target distribution q_true(z) = e^{-U(z)}/Z, where U(z) is one of the energy functions defined in table 1 of the paper. As the parameters being optimized do not affect q_true(z) we can safely ignore the partition function Z, if I'm not mistaken.
    So the final loss would be something like loss = KL [qhat_true(z)||qK(z)] = E_{zK~qK(z)}[log qhat_true(zK) - log qK(zK)] = E_{z0~q0(z)}[-U(zK) - log(Z) - log q0(z0) + sum_k logdet J_k]

Is there any reason to use squared error instead?

I agree with KL instead og squared error. Ether we can just set q0 to a standard normal or we could just fit the mean/sigma to U before optimizing the flow parameters?

I think q_true(z) = e^{-U(z)}/Z should be p_true(z) = e^{-U(z)}/Z and the loss:
[phat_true(z)||qK(z)] = E_{zK~qK(z)}[log phat_true(zK) - log qK(zK)]
= E_{z0~q0(z)}[-U(zK) - log(Z) - log q0(z0) + sum_k logdet J_k]
\prop E_{z0~q0(z)}[-U(zK) - log q0(z0) + sum_k logdet J_k]

Yes, I think that makes sense. Maybe a good way would be to have eps ~ N(0, I) and then z0 = mu + exp(0.5*log_var)*eps and jointly optimize {mu, log_var} together with the NF parameters using the same KL-divergence loss as above.

I didn´t get the part about the sign changing, but I agree p(z) is a better name than q_true(z).

Are you up for testing it? Otherwise I'll try it out at some point, however that might first be next week er something like that

I think in the above equations the KL term was swapped and it should be KL[qK(z)||p(z)]..

However, even with this change I didn't have much luck. The initial distribution (standard normal) just gets compressed to one of the sides (which side depends a bit on optimizer, initialization, etc.).

nf_planar1

I also tried a simpler case where p(z) is a diagonal covariance normal distribution with some manually set mean and variance, and I applied the flow f(z) = mu + sigma*z..
This does seem to work, except that I had some Theano-problems getting the transformed distribution out, and could only get the transformed samples out..

nf_linear1

I wonder if maybe there is some issue with the NormalizingPlanarFlowLayer after all..

Ok thanks for the effort :)! so either we can't figure out how they ran the experiments or there is a bug in the NormalizingPlanarFlowLayercode.

Can you share the code you used to run the experiments with me?

The code is here
https://gist.github.com/wuaalb/c5b85d0c257d44b0d98a

Maybe one interesting experiment would be to apply two planar transformations f(z) = z + u h(w^T z + b) and manually figure out some settings of {u, w, b} (there's a geometric interpretation of the transform) that "split" a spherical distribution into a bimodal distribution like in figure 1 of the paper. Then try learning that and seeing what goes wrong.

Thanks for the Code. Here's a few comments
I think you need to subtract p(zK) = N(zK | 0,I) from the loss i.e. eq (20) in the paper?https://gist.github.com/wuaalb/c5b85d0c257d44b0d98a#file-nf6_1-py-L174:

if i do that with k=32 i get
figure_1
I.e. one of the modes correct

Sounds interesting.. but I'm not completely sure what you mean; could you post the exact change? What's "the loss jf"?

I ran some more experiments; and fixed the 3rd plot (so 3rd and 4th plot should be identical).
With nflows=8 I get some thing similar to you
_nf2_nflows 8_winit normal 0

However, if I initialize with w=lasagne.init.Normal(mean=[1.0, 0.0]), it is able to split the distribution into a bimodal one.
_nf2_nflows 8_winit 1 0

So, I think this at least means that NormalizingPlanarFlowLayer outputs the correct f(z) and log det |J|. So if it has a problem, it is probably in how the constraints are applied or in the initialization of the parameters; or (perhaps more likely) there's something wrong with the loss or optimization.

"I think you need to subtract p(zK) = N(zK | 0,I) from the loss i.e. eq (20) in the paper": Yes that was a bit cryptic :). I just meant that log p(x,zK) = log p(zK) + log p(x|zK) = log p(zK) + U_z(zK). I think your code is missing log p(zK)?

Your second plot looks very much like the paper - thats great. Maybe they just didn't completely specify how they initialized the params then. If I add log p(zK), use k=8 and w=lasagne.init.Normal(mean=0.0, std=1.0) i get and a loss of ≈2.15

figure_1

Hmm.. but doesn't the equation you are referencing only make sense in the standard VAE setting, where the observed data x is generated by a random processing involving unobserved latent variables z?

Here there is no observed data x.. We are just trying to get our approximate distribution q_K(z) to match the target distribution p(z) by minimizing the KL-divergence between the two..

The KL-divergence uses p(z_K) in place of p(x, z_K) in eq. 20. Keeping in mind that the energy function U(z) relates to probability like p(z) = 1/Z e^{-U(z)}, we get log p(z_K) = -U(z_K) - log Z (the normalization constant log Z term is omitted from the loss function as we do not know it and it doesn't affect the optimization).

_nf2_nflows 32_winit uniform -1 1 _uinit normal mean 0 std 1 _rmspropmom_annealed_250epochs_edit

Getting a little closer, but it seems quite fiddly..

Sorry for the late response.

It looks really good. If you make a short example I'll be very happy to include it in the example section. I hope to get around to testing the norm-flow implementation more throughout on MNIST soon, but your results seems to indicate that it is working.

I wrote an email to Rezende about this and he kindly confirmed that this is in fact a pretty tricky optimization problem and that the parameters should be initialized by drawing from a normal distribution with small variance (ie. the transforms start out close to the identity map).

Unfortunately, I haven't still been able to find a single configuration that works well for all the problems, but for some it is OK (although not quite as good as the plots in the paper), e.g.
http://gfycat.com/ReasonableVibrantDove

I'm a little short on time right now, but once I get a working example I don't mind contributing it to Parmesan.

If you try the MNIST example and you can afford to do multiple runs, I'd be interested it knowing if initializing the b parameter to const(0) vs. drawing from normal(std=1e-2) makes a difference (also initializing {u,w,b} with normal(std=1e-2) vs. normal(std=1e-3)).

Hello, I have been working on reproducing the work in this paper as well. I found that, in both the synthetic examples and on MNIST, increasing the variance of the distribution from which parameter initializations are drawn was very helpful. For example, try Uniform(-1.5, 1.5).

Annealing was also quite helpful for the synthetic cases, and I have found some evidence that it is also helpful for MNIST. I also found iterative training helpful for MNIST (i.e. successively add each flow layer throughout training), though it wasn't helpful for the synthetic examples.

I would be interested to hear how any of this works for you.

@wuaalb would it possible for you to share your implementation that produced the above gfycat with me?

@yberol
Yes, sure, I uploaded it here https://gist.github.com/wuaalb/af959f5c7114cb8c9c458633d009fd14
Not 100% sure the current settings produce that gif.. If not, the correct settings are probably somewhere in there, but commented out.

@wuaalb Thank you very much, really appreciate your help!

@wuaalb Thanks again for sharing your implementation, it guided me a lot as I re-implemented it using autograd. However, I have a question.

When I plot the histogram of the samples z_K, everything is as expected. You are also plotting q_K(z_K) on the uniform grid. To compute q_K(z_K) on the grid, wouldn't I need to follow the inverse flow and get z_0, z_1, ..., z_{K-1} that produced z_K? Right now, I compute z_K's when z_0 is the uniform grid, but then the z_K's are warped and the plot does not look very nice. I would appreciate it if you can clarify how you computed q_K(z_K) for me.

2beans
2sines

It's been like a year and a half, so I don't remember any of the details..

To me your plots look more or less OK, just with a white background and maybe some scaling issue (?).

I think in my code example this is the relevant code
https://gist.github.com/wuaalb/af959f5c7114cb8c9c458633d009fd14#file-nf6_1-py-L234-L246

This is to avoid the white background

cmap = matplotlib.cm.get_cmap(None)
ax.set_axis_bgcolor(cmap(0.))

I think q_K(z_K) is computed like

log_qK_zK = log_q0_z0 - sum_logdet_J
qK_zK = T.exp(log_qK_zK)

Hi,

How many epochs do you need to train for each U(x) distribution?

My experiments, after 15 epochs, each epoch 10000 steps, still nothing sampled.

Regards,
Wei

Hello, I have been working on reproducing the work in this paper as well. I found that, in both the synthetic examples and on MNIST, increasing the variance of the distribution from which parameter initializations are drawn was very helpful. For example, try Uniform(-1.5, 1.5).

Annealing was also quite helpful for the synthetic cases, and I have found some evidence that it is also helpful for MNIST. I also found iterative training helpful for MNIST (i.e. successively add each flow layer throughout training), though it wasn't helpful for the synthetic examples.

I would be interested to hear how any of this works for you.

@justinmaojones, how do you train for MNIST if you don't know the target distribution? would you be so kind to share the (pseudo) code?

vr308 commented

@wuaalb Thanks again for sharing your implementation, it guided me a lot as I re-implemented it using autograd. However, I have a question.

When I plot the histogram of the samples z_K, everything is as expected. You are also plotting q_K(z_K) on the uniform grid. To compute q_K(z_K) on the grid, wouldn't I need to follow the inverse flow and get z_0, z_1, ..., z_{K-1} that produced z_K? Right now, I compute z_K's when z_0 is the uniform grid, but then the z_K's are warped and the plot does not look very nice. I would appreciate it if you can clarify how you computed q_K(z_K) for me.

2beans
2sines

Hi @yberol did you manage to figure out how to get around the warping issue of the transformed grid?