/Audio-Visual-TAD

Centre Stage: Centricity-based Audio-Visual Temporal Action Detection

Primary LanguagePython

Audio-Visual Temporal Action Detection

This repository implements the boundaries head proposed in the paper:

Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett, Centre Stage: Centricity-based Audio-Visual Temporal Action Detection, VUA, 2023

This repository is based on ActionFormer.

Citing

When using this code, kindly reference:

@INPROCEEDINGS{Hanaudiovisual,
  author={Wang, Hanyuan and Mirmehdi, Majid and Damen, Dima and Perrett, Toby}
  booktitle={The 1st Workshop in Video Understanding and its Applications (VUA 2023)},
  title={Centre Stage: Centricity-based Audio-Visual Temporal Action Detection},
  year={2023}}

Dependencies

  • Python 3.5+
  • PyTorch 1.11
  • CUDA 11.0+
  • GCC 4.9+
  • TensorBoard
  • NumPy 1.11+
  • PyYaml
  • Pandas
  • h5py
  • joblib

Complie NMS code by:

cd ./libs/utils
python setup.py install --user
cd ../..

Preparation

Datasets and feature

You can download the annotation repository of EPIC-KITCHENS-100 at here. Place it into a folder: ./data/visual_feature/epic_kitchens/annotations.

You can download the videos of EPIC-KITCHENS-100 at here.

You can download the visual feature on EPIC-KITCHENS-100 at here. Place it into a folder: ./data/visual_feature/epic_kitchens/features.

You can extract the audio feature on EPIC-KITCHENS-100 follow this repository here. Place extracted features into a folder: ./data/audio_feature/extracted_features_retrain_small_win.

If everything goes well, you can get the folder architecture of ./data like this:

data 
├── audio_feature
├         └── extracted_features_retrain_small_win              
└── visual_feature
          └── epic_kitchens                    
                 ├── features              
                 └── annotations

Pretrained models

You can download our pretrained models on EPIC-KITCHENS-100 at here.

Training/validation on EPIC-KITCHENS-100

To train the model run:

python ./train.py ./configs/epic_slowfast.yaml --output reproduce  --loss_act_weight 1.7  --cen_gau_sigma 1.7 --loss_weight_boundary_conf 0.5 

To validate the model run:

python ./eval.py ./configs/epic_slowfast.yaml ./ckpt/epic_slowfast_reproduce/name_of_the_best_model 

Results

[RESULTS] Action detection results_self.ap_action

|tIoU = 0.10: mAP = 20.88 (%)
|tIoU = 0.20: mAP = 20.13 (%)
|tIoU = 0.30: mAP = 18.92 (%)
|tIoU = 0.40: mAP = 17.51 (%)
|tIoU = 0.50: mAP = 15.03 (%)
Avearge mAP: 18.50 (%)
[RESULTS] Action detection results_self.ap_noun

|tIoU = 0.10: mAP = 26.78 (%)
|tIoU = 0.20: mAP = 25.58 (%)
|tIoU = 0.30: mAP = 23.91 (%)
|tIoU = 0.40: mAP = 21.45 (%)
|tIoU = 0.50: mAP = 17.68 (%)
Avearge mAP: 23.08 (%)
[RESULTS] Action detection results_self.ap_verb

|tIoU = 0.10: mAP = 24.11 (%)
|tIoU = 0.20: mAP = 23.00 (%)
|tIoU = 0.30: mAP = 21.66 (%)
|tIoU = 0.40: mAP = 20.16 (%)
|tIoU = 0.50: mAP = 16.57 (%)
Avearge mAP: 21.10 (%)

Reference

This implementation is based on ActionFormer.