From the paper Shot Retrieval and Assembly with Text Script for Video Montage Generation in PyTorch.
Code derived from the supplementary material section here.
Original (paper's) VSPD dataset at https://github.com/RATVDemo/RATV
Assuming the RATV/
dir as root...
pip install -r requirements.txt
For VSPD:
wget -O shots.tar "https://drive.google.com/uc?export=download&id=1wL3tgbDmzHL0arkBP_9SsLrOZQuiP1HL&confirm=t"
tar -xvf shots.tar
wget -O train_captions.jsonl "https://drive.google.com/uc?export=download&id=1vFW_z8WgMYgzw_evhFKs3QyucJ6DwKzT"
wget -O test_captions.jsonl "https://drive.google.com/uc?export=download&id=1s8TdVW6gGGlQYKBYDsyUqr7zkNHjVp94"
For tiktok, assuming the bucket tiktok_montage
is accessible from the instance:
gsutil -m cp -r "gs://tiktok_montage/scenes" .
gsutil -m cp "gs://tiktok_montage/tiktok_montage_gt_4w_9k.json" .
We need to embed the videos before training and all.
For this, run embed_videos.py
With train.py
. For instance:
export WANDB_API_KEY="..."
export ENCODER_TYPE="blip2"
python ~/RATV/train.py --data_dir ~/RATV/Embeddings_blip2_proj_f5/ --json_file ~/RATV/data/train_captions.jsonl --embed_dim 256 --hidden_size 256 --learning_rate 1e-4 --loss_weight 0.5 --batch_size 128 --wandb_name "RATV-Full" --epochs 200
With generate.py
. For instance:
export ENCODER_TYPE="blip2"
python ~/RATV/generate.py \
--data_dir ~/RATV/Embeddings_blip2_f5/ \
--data_dir_proj ~/RATV/Embeddings_blip2_proj_f5/ \
--json_file ~/RATV/data/test_captions.jsonl \
--transformer_path ~/RATV/checkpoints-glad-bee/transformer_pretrained_40.pt \
--threshold 0.5 \
--weight 0.75 \
--hidden_size 768 \
--sim_func default \
--allow_multi_embed \
--output_file ~/RATV/test_captions_generated.jsonl
With evaluate.py
. For instance:
export ENCODER_TYPE="blip2"
python ~/RATV/evaluate.py \
--fea_root ~/RATV/Embeddings_blip2_proj_f5/ \
--target_file ~/RATV/data/test_captions.jsonl \
--generated_file ~/RATV/test_captions_generated.jsonl
Note: for consistency, we evaluate all outputs with BLIP2 embeddings.