Tell me what you see: A zero-shot action recognition method based on natural language descriptions

Summary

This repository contains the implementation for the paper "Tell me what you see: A zero-shot action recognition method based on natural language descriptions" Multimedia Tools and Applications.

As shown in Figure 1, several captioning models (called observers) watch the videos and provide a sentence description. These descriptions are projected onto a semantic space shared with textual descriptions for each action class.

Figure 1. The semantic representation of our ZSAR method. In (a), we show the visual representation procedure. A video is seen by some video captioning systems, called Observers, which produce a video description. In (b), the semantic representation is shown. Using a search engine on the Intenet, we collect documents containing textual descriptions for the classes. In this case, the Balance Beam action is preprocessed to select the ten most similar sentences compared to the class name. Finally, in (c), the joint embedding space constructed using a BERT-based paraphrase embedder is used to project both representations in a highly structured semantic space.

We reach state-of-the-art performance on UCF101 and competitive performance on HMDB51 in a scenario in which no training classes are used for training the ZSAR method. The tables below show our results under the TruZe protocol (GOWDA et al., 2021) and under a relaxed ZSAR constraint (for details and explanation on this relaxed constraint, see Section 4.5.6 from the paper).

The code was tested on Ubuntu 20.10 with NVIDIA GPU Titan Xp. Using another software/hardware might require adapting conda environment files.

Observers

Our Observers came from two different architectures shown in Figure 2, the bi-modal transformer (BMT) (IASHIN and RAHTU, 2021), and transformer - using the MDVC implementation (IASHIN and RAHTU, 2021) with different feature schemes as inputs, shown in Figure 3.

Figure 2. Overview of the captioning architectures showing the Bi-Modal Transformer and Transformer layers with their inputs and the language generation module. Adapted from Estevam et al. (2021).

If you are interested in retraining or reproducing our observer results, see the repository DVCUSI for instructions on how to get and run the captioning models. At the same time, if you desire to compute your Visual GloVe features, there are instructions for it too.

Informations

All the features used are available for download in the Downloads section.

We format the data from UCF101 and HMDB51 in a format processed by BMT or MDVC methods. The files data/ucf101/ucf101_bmt_format.txt, data/ucf101/ucf101_mdvc_format.txt, data/hmdb51/hmdb51_bmt_format.txt, and data/hmdb51/hmdb51_mdvc_format.txt must replace the Activitynet Captions validation files when using the observer models to produce descriptive sentences.

Reproducing our results

Step 1: clone this repository

git clone git@github.com:valterlej/zsarcap.git

Step 2: create the conda environment

conda env create -f ./conda_env.yml
conda activate zsar

Step 3: install the spacy module

python -m spacy download en

Step 4: install sent2vec module

cd model/sent2vec
make
pip install .

For more details see https://github.com/epfml/sent2vec#setup-and-requirements.

Step 5: Download wikibigrams.bin file

Download and save wikibigrams.bin on data/ directory. (~ 17.2 GB).

Step 6: Download observers predictions

Download and save observers_predictions.zip on data/

Step 7: Run experiment

Some example commands:

UCF101

python run_experiment.py --dataset_name ucf101 \
    --dataset_class_list ./data/ucf101/texts/ucf_classes.txt \
    --dataset_train_test_class_list ./data/ucf101/texts/truezsl_splits.json \
    --dataset_descriptions_dir ./data/ucf101_texts/ \
    --embedder_for_semantic_preprocessing paraphrase-distilroberta-base-v2 \
    --zsar_embedder_name paraphrase-distilroberta-base-v2 \
    --min_words_per_sentence_description 15 \
    --max_sentences_per_class 10 \
    --k_neighbors 1 \
    --metric cosine \
    --observer_paths data/observers/ob1_transformer_i3d/ucf101.json data/observers/ob2_bmt_i3d_vggish/ucf101.json data/observers/ob3_transformer_i3dvisglove/ucf101.json

Use all observers to reproduce the paper results.

HMDB51

python run_experiment.py --dataset_name hmdb51 \
    --dataset_class_list ./data/hmdb51/texts/hmdb_classes.txt \
    --dataset_train_test_class_list ./data/hmdb51/texts/truezsl_splits.json \
    --dataset_descriptions_dir ./data/hmdb51_texts/ \
    --embedder_for_semantic_preprocessing paraphrase-distilroberta-base-v2 \
    --zsar_embedder_name paraphrase-distilroberta-base-v2 \
    --min_words_per_sentence_description 15 \
    --max_sentences_per_class 10 \
    --k_neighbors 1 \
    --metric cosine \
    --observer_paths data/observers/ob1_transformer_i3d/hmdb51.json data/observers/ob1_transformer_i3d/hmdb51_sk16_sp16.json data/observers/ob1_transformer_i3d/hmdb51_sk10_sp10.json

You can see all the parameters with:

python run_experiment.py --help

Downloads

Features

ActivityNet Captions acnet_sk24_sp24.zip, vggish_acnet.zip

UCF101 ucf101_sk24_sp24.zip, vggish_ucf101.zip

HMDB51 hmdb51_sk24_sp24.zip, hmdb51_sk16_sp16.zip, hmdb51_sk10_sp10.zip.
Observers (pre-trained models) observers_models.zip
Observers (predictions for ucf101 and hmdb51) observers_predictions.zip

Main References

Visual GloVe

Estevam, V.; Laroca, R.; Pedrini, H.; Menotti, D. Dense Video Captioning using Unsupervised Semantic Information. In. CoRR, 2021.

MDVC

Iashin, V.; Rahtu, E. Multi-Modal Dense Video Captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 958-959.

BMT

Iashin, V.; Rahtu, E. A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer. In. British Machine Vision Conference (BMVC), 2020.

TruZe protocol

Gowda S.N.; Sevilla-Lara L.; Kim K.; Keller F.; Rohrbach M. A New Split for Evaluating True Zero-Shot Action Recognition. In: Bauckhage C., Gall J., Schwing A. (eds) Pattern Recognition. DAGM GCPR, 2021. Lecture Notes in Computer Science, vol 13024. Springer, Cham.

For the complete list, see the paper.

Citation

Our paper is available on arXiv. Please, use this BibTeX if you would like to cite our work.

@article{estevam2024tell,
  title = {Tell me what you see: A zero-shot action recognition method based on natural language descriptions},
  author = {V. {Estevam} and R. {Laroca} and H. {Pedrini} and D. {Menotti}},
  year = {2024},
  journal = {Multimedia Tools and Applications},
  volume = {83},
  number = {9},
  pages = {28147-28173},
  doi = {10.1007/s11042-023-16566-5},
  issn = {1573-7721}
}