This repository contains an implementation of DeepMind's MONet model for unsupervised scene decomposition in PyTorch. The model was presented in the paper "MONet: Unsupervised Scene Decomposition and Representation" by Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick and Alexander Lerchner.
Similarly to previous models such as AIR, MONet learns to decompose scenes into objects and background in an unsupervised setting. Unlike AIR however, it learns attention masks to obtain real segmentations instead of just bounding boxes. Objects and background appearances are modelled by a VAE.
The following image shows a sample of results on a homemade version of the sprite dataset. The first line of images depicts the input, the second the inferred segmentation, and the third the reconstruction.
The attention network successfully learns to segment the images. One issue appears to be that distinct objects of the same color tend to not be separated. Since the model structure does not force objects to be spatially coherent, this is perhaps to be expected.
We ran our experiments using Python 3.6 and CUDA 9.0, making use of the following Python packages:
- torch 1.0
- numpy
- visdom
These may be installed via pip install -r requirements.txt
. Other versions might also work but
were not tested.
model.py
contains the model, implemented as a set of PyTorch modulesmain.py
contains the training loopsconfig.py
contains adjustable parameters, including directories and hyperparametersdatasets.py
contains routines for loading the data
Simply run python main.py
. Adjust the configuration object created at the bottom of the file as
needed, or use one of the provided configurations to reproduce the results above. Note that the experiments
on CLEVR were run on a V100 GPU with 32GB of memory, so you may need to reduce the model size in order to fit
it on a smaller GPU.