Shape-aware Text-driven Layered Video Editing [CVPR 2023]

Yao-Chih Lee, Ji-Ze G. Jang, Yi-Ting Chen, Elizabeth Qiu, Jia-Bin Huang

[Webpage] [Paper]

Environment

  • Tested on Pytorch 1.12.1 and CUDA 11.3
git clone --recursive https://github.com/text-video-edit/shape-aware-text-driven-layered-video-editing-release.git 
pip install -r requirements.txt
  • Prepare super-resolution model (just for improving the quality of the diffusion-edited images)
./scripts/setup_esrgan.sh

Data structure

For an input video, the required data and structure are listed below. The NLA's checkpoint and configuration files are obtained by Layered Neural Atlases.

DATA_DIR/
├── images/
│   └── *.png or *.jpg
├── masks
│   └── *.png 
└── pretrained_nla_models
    ├── checkpoint
    └── config.json

Each edit case will be saved in EDIT_DIR, which is put under the DATA_DIR. We provided some examples in data directory.

For instance, DATA_DIR=data/car-turn and EDIT_DIR=data/car-turn/edit_sports_car.

Running

  • Generate NLA outputs from NLA pretrained model

    python scripts/generate_nla_outputs.py [DATA_DIR]
    
  • Edit Foreground

      python scripts/edit_foreground [DATA_DIR] [TEXT_PROMPT]
    
    • Please put your HuggingFace token file named TOKEN in root directory.

    • It will create the EDIT_DIR under DATA_DIR and all the keyframe's data will be saved in EDIT_DIR. Note that we may manually refine the mask of the edited keyframe sometime since MaskRCNN may fail to find precise masks for diffusion-generated images.

    • Semantic correspondence

      To achieve the best editing results, we use the warping tools in Photoshop to obtain the semantic correspondence between EDIT_DIR/keyframe_input_crop.png and EDIT_DIR/keyframe_edited_crop.png, and saved as EDIT_DIR/semantic_correspondence_crop.npy. The correspondence format is similar to optical flow, ranging from [-1, 1].

  • Optimization

    python main.py [EDIT_DIR]
    

    The training results will be saved in EDIT_DIR/workspace

Acknowledgements

We thank the authors for releasing Layered Neural Atlases, Text2LIVE, Stable-DreamFusion, and Real-ESRGAN.