MMVideoTextRetrieval

MMVideoTextRetrieval is an open source video-text retrieval toolbox based on PyTorch.

Introduction

This repository provides different video text retrieval methods.

Major Features

Modular design

We decompose the video-text retrieval framework into different components which can be easily used any combination.
Support for various datasets and features

The toolbox supports multiple datasets, such as MSRVTT, ActivityNet, LSMDC. Besides, various extracted features are provided.
Support for multiple video text retrieval frameworks

MMVideoTextRetrieval implements popular frameworks for video text retrieval, such as MMT, etc. More frameworks will be added later.
Visual demo

We provide the demo to visualize the results of video text retrieval models.

Demo

We provide a way to produce text-to-video retrieval in real-world applications. Before retrieval, the multi-model features of videos should be extracted and stored. The searched text is defined in the "main_train" function in demo.py, and the config "--sentence" should be used to activate the retrieval process. The outputs of the retrieval are the name of video feature files of the top 10 similar videos.

Benchmark

Model	Dataset	Video Feature	Text Feature	Pretrained	Text-to-Video Retrieval			Video-to-Text Retrieval
					R@1	R@5	R@10	R@1	R@5	R@10
MMT	MSTVTT-1kA	S3D	Bert	no	24.6	54	67.1	24.4	56	67.8
MMT	ActivityNet	S3D	Bert	no	22.7	54.2	93.2	22.9	54.8	93.1
MMT	LSMDC	S3D	Bert	no	13.2	29.2	38.8	12.1	29.3	37.9
MMT	MSTVTT-1kA&B	S3D	Bert	HowTo100M	26.6	57.1	69.6	27	57.5	69.7
MMT	ActivityNet	S3D	Bert	HowTo100M	28.7	61.4	94.5	28.9	61.1	94.3
MMT	LSMDC	S3D	Bert	HowTo100M	12.9	29.9	40.1	12.3	28.6	38.9
HGR	MSTVTT-Full	Resnet152	Word2Vec	no	9.2	26.2	36.5	15	36.7	48.8

(All the results are excerpted from the original paper and will be replaced by the results of pre-trained models later.)

Model Zoo

supported methods for Video Text retrieval.

MMT (ECCV'2020)
MMT-modified (ICMEW'2021)
HGR (CVPR'2020)

Dataset

supported datasets.

(click to collapse)

MSR-VTT
ActivityNet Captions
- raw dataset
- multi-modal features
LSMDC
- raw dataset
- multi-modal features
TGIF
- raw dataset
- Resnet152 video features
VATEX
- raw dataset
- I3D video features

Get stated

Requirements

Python 3.7

Pytorch 1.4.0 +
Transformers 3.1.0
Numpy 1.18.1

pip install -r requirements.txt

Training

Training + evaluation:

python -m demo --config configs/$model_name/$dataset_$split_trainval.json

Evaluation from checkpoint:

python -m demo --config configs/$model_name/$dataset_$split_trainval.json --only_eval --load_checkpoint $checkpoint_path

Training from pretrained model:

python -m demo --config configs/$model_name/prtrn_$dataset_$split_trainval.json --load_checkpoint $checkpoint_path

Retrieval videos with a specific sentence:

python -m demo --config configs/$model_name/$dataset_$split_trainval.json --only_eval --load_checkpoint $checkpoint_path --sentence

Using the modified version of MMT for training:

python -m demo --config configs/$model_name/prtrn_$dataset_$split_trainval.json --modified_model

shilrley6/MMVideoTextRetrieval