gugarosa/learnergy

a method for getting samples of synthesized visible units

bhomass opened this issue · 9 comments

Pre-checkings

  • [x ] Check that you are up-to-date with the master branch of Learnergy. You can update with:
    pip install git+git://github.com/gugarosa/learnergy.git --upgrade --no-deps

  • [x ] Check that you have read all of our README.

Description

I look and could not find existing methods for sampling visible units from a trained Gaussian RBM model. The procedure should be starting with a random input dataset to feed into the visible layer, do hidden and visible sampling back and forth until convergence, then take the final visible unit values as the synthetic data.

Since I could not find such a method, I wrote my own and tested on a particular class of the MNIST dataset. The results are quite awful. you could hardly make out a 4, but very distorted and noisy. The training process does appear to have reached convergence after about 100 epochs. and I tried a number of values for iterations in the synthetic sampling run.

My questions:

  1. maybe such a method already exists, but I don't find it.
  2. has this test been done before using the MNIST dataset and yielded good synthetic results?
  3. any feeling what could be wrong with my procedure? Do I need to add dropouts?

Steps to Reproduce

  1. train one class of the MNIST dataset, I used 4. train multiple epochs until convergence
  2. wrote my own sampling routine which starts with random values for the visible layer, and then iteratively sample between hidden and visible layers for iterations from 10 to 400.
  3. visualize the resulting image, receive a very distorted 4
  4. next I tried not sampling from random, but simply calling the reconstruct() method, and the reconstructed image is also poor. Could this mean the training simply isn't complete, or unable to reach high quality capture of the distribution.

I tried out GaussianReluRBM. It converges really fast to a small mse. However, when trying to obtain samples from the trained model, you get what seems like random noise. The reconstruction mse is also much higher than that from the training dataset. What is the proper method for obtaining a reasonable set of samples from this model? I am not able to find any reference of previous work on the proper way to sample from a model with a relu hidden layer.

On adding edropout to GaussianRBM, the reconstruct mse from a test dataset is now very close to the training dataset. There is still substantial degradation when sampling from the trained model, and the training cycle is excruciatingly slow.

Hello @bhomass,
Sorry for the delay in responding to your questions, let's smash this down!

First of all, the procedure to generate new samples is the standard gibbs_sampling(). As you correctly pointed out, we must iterate over the hidden and visible layer until the RBM reaches a stable chain on the Markov chain; however, starting from random noise is highly difficult for our models, since our lib has only the CD-k procedure. The CD-k reaches, theoretically, the equilibrium on training very fast, but in practice, it does not lead to a good generative process from random input since the model is not trained/model to this purpose, and we cannot compute the exact partition function to write the probability density function (v,h states).
To better understand what I said, I recommend this paper: Quickly Generating Representative Samples from an RBM-Derived Process, in which the authors deeper explain "And so although the standard sampling method for RBMs is to run a Gibbs chain, it is never actually run to equilibrium except in the most trivial of cases". Given that, to improve the generative performance, we need to implement the Persistent Contrastive Divergence (PCD), or another sampling technique that allows us to "remember how the training data was".

So, regarding the GaussianReluRBM, such a model has the same drawback with CD-k with more negative influence, since this hidden unit is more prone to "memorize the input" if the training parameters were not well defined to smooth (not faster) convergence. This model is good to be fine-tuned in a supervised manner for classification purposes. Also, for further improvements, we should implement a sparsity penalty. Look at this paper: Rectified Linear Units Improve Restricted Boltzmann Machines, from Hinton.

Finally, dropout regularization may improve the learning on RBMs since it does not turn off many neurons; however, as you pointed out, edropout is more expensive than regular dropout due to its several energy calculations. Regarding data reconstruction from the same distribution (test data), edropout wins on our experiments and may be useful in some tasks such as denoising, or data reconstruction from corrupted/incompleted samples. If you are interested in testing such tasks, please report to us :D

So, I hope this answer helps you better understand our library and its limitations. I will also tell some additional, and personal, points desire:
I really hope to have PCD in our sampling routine;
I am very interested in developing an RBM able to sample from noise. I wasted a lot of time thinking about this, but my time to test different approaches and write some math is a little bit scarce nowadays, unfortunately;

If you write some additional code that can improve the field, and our lib, feel free to create a branch and contact us, will be amazing to have more people working with us.

Best regards,
Mateus.

I have gone through the rated-FPCD method in the paper you referenced. I am hopeful this will solve the sampling problem I am facing. There is a lingering concern whether this methods works for real inputs. Since all the formula assumes bernoulli inputs. Probably a more pertinent point is whether it is implied that during sampling, you are expected to use the original data to start the MC chain, instead of random initialization. If, so, it really kills the use case for deriving synthetic data from the model, away from close proximity of the training data.

May I ask why you did not implement the FPCD method of training and opted for edropout instead?

Hello Bruce,

Good understanding and questioning. You forced me to read my old annotations to answer you kkkk let's start.

From a high-level perspective, the PCD is only for the training step, to force our model to learn the equilibrium state. We start an auxiliary variable (e.g., self.particles=torch.randn(batch, visible_shape)) to gather the negative particles (p(v'|h)) that are generated when we run a Gibbs chain, p(h|v) and p(v'|h), on training phase. At each batch iteration, we save the "negative visible batch", self.particles=p(v'|h), to start from it (visible=self.particles) on the next iteration, contrary to CD-1 that we start "visible=samples" every batch iteration. This simple modification forces the model to learn the equilibrium state, capturing the true data distribution p(v) (in theory).

Given that, after training the model we can simply pass random noise as visible samples to the model infer p(h|v) and sample for this in the Gibbs chain. After some Gibbs steps, the data generated (p(v|h)) will be close to the train data (in theory) since the model achieved equilibrium in the training phase.

Answering your question, we did not implement the PCD for the learning procedure since it was not necessary for our academic applications. At the beginning of my coding/rbm learning, I wrote a piece of code, out of the lib, but at that time I think the understanding/code was wrong, and paused the development. So, I think it is a good time to reactivate this code/research.
And, the edropout was just an academic work developed to be an alternative to the original dropout, without thinking about PCD, since at that time was not necessary to generate samples from random, just to prove that edropout is better for reconstruction purposes.

I hope I was clear in the explanations.

Best regards,
Mateus.

Mateus

I think we lost the train of thought in this thread.

This all started when I tested out your sampling routine, and got very poor results. This led you to point to the paper which describes the rated-fpcd method by Breuleux. When I read that paper, the author is very vague whether he uses random numbers to start the chain. Mind you, there is no known published code for this. I am just wondering whether it is worth pursuing. If the whole idea is to use existing training data to start the chain, this would be a moot point. I can only start with random input in my experiment.

Hello, Bruce.

Well, unfortunately, I must agree that some of the old-school papers omit some decent pieces of information sometimes.

Regarding your question, If it is promising to keep investigating the sample generation from random input, my answer is yes. It can be hard to achieve but is possible. As we frequently see in CD-k, theoretically it is possible to generate good samples from random input, but in practice, it crashes on such a task, and PCD may overcome that given the authors' experiment.

About your experiments...given a trained RBM (without random data as input for training) you need to sample from a random input to generate synthetic valid data, is that right? If I correctly understood, you can employ PCD to train your model and apply the random input to iterate into a Gibbs chain to obtain your synthetic samples.

Concerning our implementation, I would like to implement the PCD as soon as I can, and I think that will be "easy" given the RBM classes created.

Please let me know If I do not understand some points correctly.
Best,
Mateus.

I understood all that you said, but I am not able to get a good fantasy particle from a trained gaussian-bernoulli model. This is what I got.

bad_4

are you able to do much better?

Interesting...
I am working on a convolutional implementation, but at this point, with worst results than yours.
Hope to be back with good news soon.