Yao-Chih Lee, Ji-Ze G. Jang, Yi-Ting Chen, Elizabeth Qiu, Jia-Bin Huang
- Tested on Pytorch 1.12.1 and CUDA 11.3
git clone --recursive https://github.com/text-video-edit/shape-aware-text-driven-layered-video-editing-release.git
pip install -r requirements.txt
- Prepare super-resolution model (just for improving the quality of the diffusion-edited images)
./scripts/setup_esrgan.sh
For an input video, the required data and structure are listed below. The NLA's checkpoint and configuration files are obtained by Layered Neural Atlases.
DATA_DIR/
├── images/
│ └── *.png or *.jpg
├── masks
│ └── *.png
└── pretrained_nla_models
├── checkpoint
└── config.json
Each edit case will be saved in EDIT_DIR
, which is put under the DATA_DIR
. We provided some examples in data
directory.
For instance, DATA_DIR=data/car-turn
and EDIT_DIR=data/car-turn/edit_sports_car
.
-
Generate NLA outputs from NLA pretrained model
python scripts/generate_nla_outputs.py [DATA_DIR]
-
Edit Foreground
python scripts/edit_foreground [DATA_DIR] [TEXT_PROMPT]
-
Please put your HuggingFace token file named
TOKEN
in root directory. -
It will create the
EDIT_DIR
underDATA_DIR
and all the keyframe's data will be saved inEDIT_DIR
. Note that we may manually refine the mask of the edited keyframe sometime since MaskRCNN may fail to find precise masks for diffusion-generated images. -
Semantic correspondence
To achieve the best editing results, we use the warping tools in Photoshop to obtain the semantic correspondence between
EDIT_DIR/keyframe_input_crop.png
andEDIT_DIR/keyframe_edited_crop.png
, and saved asEDIT_DIR/semantic_correspondence_crop.npy
. The correspondence format is similar to optical flow, ranging from [-1, 1].
-
-
Optimization
python main.py [EDIT_DIR]
The training results will be saved in
EDIT_DIR/workspace
We thank the authors for releasing Layered Neural Atlases, Text2LIVE, Stable-DreamFusion, and Real-ESRGAN.