Make It Move: Controllable Image-to-Video Generation with Text Descriptions

Primary LanguagePythonApache License 2.0Apache-2.0

Make It Move: Controllable Image-to-Video Generation with Text Descriptions


This repository contains datasets and source code used in the CVPR'2022 paper ``Make It Move: Controllable Image-to-Video Generation with Text Descriptions".


  • We improved MAGE with a more prowerful autoencoder and a controller over VAE. The code and models of the improved version, MAGE+, have been released at google drive.
  • We proposed two no-reference evaluation metrics, action precision and referring expression precision, to evaluate the precision of fine-grained motions based on a captioning-and-matching method. (We chose SwinBERT as the captioning model. Please download the trained model on CATER-GENs at google drive and put it under 'metrics/swinbert_cater'.)
$ docker run --gpus all --ipc=host --rm -it --mount src=/home/user/SwinBERT/,dst=/videocap,type=bind --mount src=/home/user/,dst=/home/user/,type=bind -w /videocap linjieli222/videocap_torch1.7:fairscale bash -c "source /videocap/setup.sh && bash"
$ python metrics/swinbert_cater/eval_precision_run_caption_VidSwinBert.py --do_lower_case --do_test --eval_model_dir ./metrics/swinbert_cater/ --test_video_fname /home/results/
$ python eval_precision.py --data-root /home/user/datasets/CATER-GEN-v1 --gen-caption /home/user/results/catergenv1_diverse/generated_captions.json --mode ambiguous

Dataset Generation

Moving MNIST datasets

The scripts to generate Moving MNIST datasets are modified based on Sync-DRAW. You can run the following commands to generate Single Moving MNIST, Double Moving MNIST and our Modified Double Moving MNIST, respectively.

$ python data/mnist_caption_single.py
$ python data/mnist_caption_double.py
$ python data/mnist_caption_double_modified.py


Datasets Download

The original CATER-GEN-v1 and CATER-GEN-v2 used in our paper are provided at link1 and link2, respectively.

Create Your Own Datasets

Thanks to authors of CATER and CLEVR for making their code available, you can also generate your own datasets as following.

First, please generate videos and metadata according to the guideline of CATER. Please change the hyper-parameters including min_objects, max_objects, num_frames, num_images, width, height, and fix CAM_MOTION = False, start_frame = 0. Then, you can generate text descriptions by running:

$ python data/gen_cater_text_anno.py


There are two stages training in our proposed baseline, MAGE. The first stage is to train a VQ-VAE encoder and decoder. The second stage is to train the remaining video generation model. The trained models are provided at google drive.


Our code has been tested on Ubuntu 18.04. Before starting, please configure your Anaconda environment by

$ conda create -n mage python=3.8
$ conda activate mage
$ pip install -r requirements.txt

Stage 1. VQ-VAE Training

$ python train_vqvae.py --dataset mnist --data-root /data/data_file --output-folder ./models/vqvae_model_file

Stage 2. MAGE Training

$ python main_mage.py --split train --config config/model.yaml --checkpoint-path ./models/MAGE/model_path 


$ python main_mage.py --split test --config config/model.yaml --checkpoint-path ./models/MAGE/model_path


If you find this repository useful in your research then please cite

    title={Make It Move: Controllable Image-to-Video Generation with Text Descriptions},
    author={Yaosi Hu and Chong Luo and Zhenzhong Chen},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},