Variational Autoencoder with Arbitrary Conditioning (VAEAC) is a neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features.
For more detail, see the following paper:
Oleg Ivanov, Michael Figurnov, Dmitry Vetrov.
Variational Autoencoder with Arbitrary Conditioning, ICLR 2019,
link.
This PyTorch code implements the model and reproduces the results from the paper.
Install prerequisites from requirements.txt
.
This code was tested on Linux (but it should work on Windows as well),
Python 3.6.4 and PyTorch 1.0.
To run experiments with CelebA download dataset into some directory,
unzip img_align_celeba.zip
and set correct celeba_root_dir
(i. e. which points to the root of the unzipped folder) in file datasets.py
.
To impute missing features with VAEAC one can use impute.py
.
impute.py
works with real-valued and categorical features.
It takes tab-separated values (tsv) file as an input.
NaNs in the input file indicate the missing features.
The output file is also a tsv file, where for each object
there is num_imputations
copies of it with NaNs replaced
with different imputations.
These copies with imputations are consecutive in the output file.
For example, if num_imputations
is 2,
then the output file is structured as follows
object1_imputation1
object1_imputation2
object2_imputation1
object2_imputation2
object3_imputation1
...
By default num_imputations
is 5.
One-hot max size is the number of different values of a categorical feature. The values are assumed to be integers from 0 to K - 1, where K is one-hot max size. For the real-valued feature one-hot max size is assumed to be 0 or 1.
For example, for a dataset with a binary feature, three real-valued features
and a categorical feature with 10 classes the correct --one_hot_max_sizes
arguments are 2 1 1 1 10.
Validation ratio is the ratio of objects which will be used for validation and the best model selection.
So the minial working example of calling impute.py
is
python impute.py --input_file input_data.tsv --output_file data_imputed.tsv \
--one_hot_max_sizes 2 1 1 1 10 --num_imputations 25 \
--epochs 1000 --validation_ratio 0.15
Validation IWAE samples is a number of latent samples for each object IWAE evaluation.
Use last checkpoint flag forces impute.py
to use the state of the model
at the end of the training procedure for imputation.
By default, the best model according to IWAE validation score is used.
See python impute.py --help
for more options.
One can reproduce paper results for mushroom, yeast and white wine datasets by the following commands:
cd data
./fetch_data.sh
python prepare_data.py
mkdir -p imputations
python ../impute.py --input_file train_test_split/yeast_train.tsv \
--output_file imputations/yeast_imputed.tsv \
--one_hot_max_sizes 1 1 1 1 1 1 1 1 10 \
--num_imputations 10 --epochs 300 --validation_ratio 0.15
python ../impute.py --input_file train_test_split/mushroom_train.tsv \
--output_file imputations/mushroom_imputed.tsv \
--one_hot_max_sizes 6 4 10 2 9 2 2 2 12 2 4 4 4 9 9 4 3 5 9 6 7 2 \
--num_imputations 10 --epochs 50 --validation_ratio 0.15
python ../impute.py --input_file train_test_split/white_train.tsv \
--output_file imputations/white_imputed.tsv \
--one_hot_max_sizes 1 1 1 1 1 1 1 1 1 1 1 1 \
--num_imputations 10 --epochs 500 --validation_ratio 0.15
python evaluate_results.py yeast 1 1 1 1 1 1 1 1 10
python evaluate_results.py mushroom 6 4 10 2 9 2 2 2 12 2 4 4 4 9 9 4 3 5 9 6 7 2
python evaluate_results.py white 1 1 1 1 1 1 1 1 1 1 1 1
cd ..
Unlike missing features imputation, image inpainting usually use a dataset with no missing features and an unobserved region mask generator to learn to inpaint.
In this repository there is all necessary code to reproduce CelebA inpaintings from the paper. It includes CelebA dataset wrapper, all mask generators from the paper, and a model architecture. The code is written in such way, so you'll find it easy to use it with new datasets, mask generators, model architectures, reconstruction losses, optimizers, etc.
Image inpainting process is splitted into several stages:
- Firstly one define a model together with its optimizer, loss and
mask generator in
model.py
file in a separate directory. Such model for the paper is provided inceleba_model
directory. - Secondly, one implement image datasets (train, validation and test images
together with test masks), and add them into
datasets.py
. One can use CelebA dataset which is already implemented (but not downloaded!) and skip this step. - Then one train the model using
python train.py --model_dir celeba_model --epochs 40 \
--train_dataset celeba_train --validation_dataset celeba_val
See python train.py --help
for more options.
As a result two files are created in celeba_model
directory:
last_checkpoint.tar
and best_checkpoint.tar
.
Second one is the best checkpoint according to IWAE on the validation set.
It is used for inpainting by deafult.
If these files are already in model_dir
when train.py
is started,
train.py
use last_checkpoint.tar
as an initial state for training.
One can also download pretrained model
from here,
put it into celeba_model
directory and skip this step.
- After that, one can inpaint the test set by calling
python inpaint.py --model_dir celeba_model --num_samples 3 \
--masks celeba_inpainting_masks --dataset celeba_test \
--out_dir celeba_inpaintings
See python inpaint.py --help
for more options.
If you find this code useful in your research, please consider citing the paper:
@inproceedings{
ivanov2018variational,
title={Variational Autoencoder with Arbitrary Conditioning},
author={Oleg Ivanov and Michael Figurnov and Dmitry Vetrov},
booktitle={International Conference on Learning Representations},
year={2019},
url={https://openreview.net/forum?id=SyxtJh0qYm},
}