
The goal of this project is to synthesize the image using $\beta$-variational autoencoder. The autoencoder is trained on the MNIST data set and after training, encoder generated images of the digit from the random noise sample from normal distribution.

What is the $\beta$-Variational Autoencoder?

$\beta$-VAE is a deep unsupervised generative approach a variant of Variational AutoEncoder for disentangled factor learning that can discover the independent latent factors of variation in unsupervised data. A disentangled representation can be defined as one where single latent units of $z$ are sensitive to changes in single generative factors of $X$, while being relatively invariant to changes in other factors. A disentangled model learns independent latent units sensitive to single independent data generative factors. A disentangled representation is therefore factorised and often interpretable, whereby different independent latent units learn to encode different independent ground-truth generative factors of variation in the data.

The difference between $\beta$-VAE and VAE is the use of lagrange multiplier $/beta$ on the KL divergence term in the original VAE formulation. Objective function of $\beta$-VAE is

1_nvoD_xhYnTCDQBIDB6ywzw (1)

$\beta$-VAE attempts to learn a disentangled representation of conditionally independent data generative factors by optimizing a heavily penalizing KL-divergence between the prior and approximating distributions using a hyperparameter $\beta$ > 1. This constraint limits the capacity of $z$, which, combined with the pressure to maximise the log likelihood of the training data $X$, encourages the model to learn the most efficient representation of the data. We assume that the data $x$ has some conditionally independent ground truth factors of generation and the KL-divergence term of the $\beta$-VAE objective function encourages conditional independence in $qφ(z|x)$. Hence higher values of $\beta$ should encourage learning a disentangled representation. The extra pressures coming from high $\beta$ values, however, may create a trade-off between reconstruction fidelity and the quality of disentanglement within the learnt latent representations. Disentangled representations emerge when the right balance is found between information preservation.


  • Validation image at Epoch 0 during training (Ground truth image followed by generated image) 1 2

  • Validation image at Epoch 100 during training (Ground truth image followed by generated image) 3 4

  • Interpolation of two image by walking in latent space from image 1 to image 2 in 16 steps. interpolation_1 interpolation_2 interpolation_3 interpolation_4

  • The cluster of mean of the distribution corrosponding to the input data points. The mean are compressed from 20 dimensions to 2 dimensions for visualization. It is worth noting how mean are random at the epoch 0 and cluster together at epoch 10.



Python 2.0 or above


License: MIT

Copyright (c) Feb 2023 Pradip Kathiriya