Learn to Understand Negation in Video Retrieval

This is the official source code of our paper: Learn to Understand Negation in Video Retrieval.

Requirements

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.

conda create -n py37 python==3.7 -y
conda activate py37
git clone git@github.com:ruc-aimc-lab/nT2VR.git
cd nT2VR
pip install -r requirements.txt

Prepare Data

Download official video data

For MSRVTT, the official data can be found in link. The raw videos can be found in sharing from Frozen️ in Time.

We follow the official MSRVTT3k split and MSRVTT1k split (described in the paper JSFUSION)
For vatex, the official data can be found in this link

We follow the split of HGR
We extract frames from the video at a frame rate of 0.5s before training, using scrip from video-cnn-feat. Each data folder should also contain a file indicates frame id and the image path.(See the example of id.imagepath.txt. The prefix of frame id should be consistent with video id.)

Download text data for Training & Evaluation in nT2V

Download data for training & evaluation in nT2V. We use the prefix "msrvtt10k" and "msrvtt1kA" to distinguish MSR-VTT3k split and MSR-VTT1k split.

The training data augumented by negator is named as "**.caption.neagtion.txt". The negated and composed test query sets are named as "**.negated.txt" and "**.composed.txt".

Evaluation on test queries of nT2V

We provide script for evaluting zero-shot CLIP, CLIP* and CLIP-bnl on nT2V.

CLIP: original model, used in a zero-shot setting
CLIP*: Fine-tuned CLIP on text-to-video retrieval data using retrieval loss.
CLIP-bnl: Fine-tuned CLIP using proposed negation leraning. Here are the checkpoints and performances of CLIP, CLIP* and CLIP-bnl:

MSR-VTT3k

Model Checkpoint	Original				Negated				Composed
	$R1$	$R5$	$R10$	$MIR$	$\Delta R1$	$\Delta R5$	$\Delta R10$	$\Delta MIR$	$R1$	$R5$	$R10$	$MIR$
CLIP	20.8	40.3	49.7	0.305	1.5	2.5	2.9	0.020	6.9	24.2	35.6	0.160
CLIP*	27.7	53.0	64.2	0.398	0.5	1.1	1.1	0.008	11.4	33.3	46.2	0.225
CLIP (boolean)	--	--	--	--	18.8	37.5	46.2	5.9	16.7	23.9	0.118	0.116
CLIP* (boolean)	--	--	--	--	25.3	47.1	56.1	13.5	33.7	45.5	0.236	0.243
CLIP-bnl	28.4	53.7	64.6	0.404	5.0	6.9	6.9	0.057	15.3	40.0	53.3	0.274

MSR-VTT1k

Model Checkpoint	Original				Negated				Composed
	$R1$	$R5$	$R10$	$MIR$	$\Delta R1$	$\Delta R5$	$\Delta R10$	$\Delta MIR$	$R1$	$R5$	$R10$	$MIR$
CLIP	31.6	54.2	64.2	0.422	1.4	1.4	1.5	0.017	12.9	35.0	46.2	0.237
CLIP*	41.1	69.8	79.9	0.543	0.0	1.7	1.0	0.006	17.3	46.8	61.2	0.310
CLIP (boolean)	--	--	--	--	26.4	46.2	56.8	0.354	6.3	18.4	25.9	0.129
CLIP* (boolean)	--	--	--	--	35.9	59.5	65.2	0.463	17.6	42.0	52.0	0.291
CLIP-bnl	42.1	68.4	79.6	0.546	12.2	11.7	14.4	0.121	24.8	57.6	68.8	0.391

VATEX

Model Checkpoint	Original				Negated				Composed
	$R1$	$R5$	$R10$	$MIR$	$\Delta R1$	$\Delta R5$	$\Delta R10$	$\Delta MIR$	$R1$	$R5$	$R10$	$MIR$
CLIP	41.4	72.9	82.7	0.555	1.9	2.1	2.2	0.018	10.5	28.3	41.3	0.201
CLIP*	56.8	88.4	94.4	0.703	0.2	0.4	0.7	0.004	14.2	39.2	53.3	0.266
CLIP (boolean)	--	--	--	--	32.5	57.2	64.5	0.431	5.0	18.0	25.6	0.116
CLIP* (boolean)	--	--	--	--	25.3	47.1	56.1	0.353	14.1	34.4	45.1	0.243
CLIP-bnl	57.6	88.3	94.0	0.708	14.0	11.7	8.6	0.125	16.6	39.9	53.9	0.284

To evaluate zero-shot CLIP, run the script clip.sh

# use 'rootpath' to specify the path to the data folder
cd shell/test
bash clip.sh

To evaluate CLIP*, run the script clipft.sh

# use 'rootpath' to specify the path to the data folder
# use 'model_path' to specify the path of model
cd shell/test
bash clipft.sh

To evaluate zero-shot CLIP+boolean, run the script clip_bool.sh

cd shell/test
bash clip_bool.sh

To evaluate CLIP*+boolean, run the script clipft_bool.sh

cd shell/test
bash clipft_bool.sh

To evaluate CLIP-bnl, run the script clip_bnl.sh

cd shell/test
bash clip_bnl.sh

Train CLIP-bnl from scratch

train CLIP-bnl on MSR-VTT3k split, run

# use 'rootpath' to specify the path to the data folder
cd shell/train
bash msrvtt7k_clip_bnl.sh

train CLIP-bnl on MSR-VTT1k split, run

cd shell/train
bash msrvtt9k_clip_bnl.sh

train CLIP-bnl on VATEX, run

cd shell/train
bash vatex_clip_bnl.sh

Additionally, training script of CLIP* is clipft.sh

Produce new negated & composed data

install additional packages:

cd negationdata
pip install -r requirements.txt

download checkpoint of negation scope detection model,which is built on NegBERT
run the script prepare_data.sh

# use 'rootpath' to specify the path to the data folder
#use 'cache_dir'to specify the path to path of models used in negation scope detection model 
cd negationdata
bash prepare_data.sh

Citation

@inproceedings{mm22-nt2vr,
title = {Learn to Understand Negation in Video Retrieval},
author = {Ziyue Wang and Aozhu Chen and Fan Hu and Xirong Li},
year = {2022},
booktitle = {ACMMM},
}

Contact

If you enounter issues when running the code, please feel free to reach us.

Ziyue Wang (ziyuewang@ruc.edu.cn)
Aozhu Chen (caz@ruc.edu.cn)
Fan Hu (hufan_hf@ruc.edu.cn)

sherlockeded/nT2VR