/ADPN-MM

Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding"

Primary LanguagePython

ADPN: Adaptive Dual-branch Promoted Network

Repo for ACM 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding". This paper proposes solutions for the Temporal Sentence Grounding task from an audio-visual collaborative perspective.

Houlun Chen, Xin Wang*, Xiaohan Lan, Hong Chen, Xuguang Duan, Jia Jia*, Wenwu Zhu*.

* Corresponding authors.

framework

Abstract

Temporal Sentence Grounding aims to retrieve a video moment given a natural language query. Most existing literature merely focuses on visual information in videos without considering the naturally accompanied audio which may contain rich semantics. The few works considering audio simply regard it as an additional modality, overlooking that: i) it's non-trivial to explore consistency and complementarity between audio and visual; ii) such exploration requires handling different levels of information densities and noises in the two modalities. To tackle these challenges, we propose Adaptive Dual-branch Promoted Network (ADPN) to exploit such consistency and complementarity: i) we introduce a dual-branch pipeline capable of jointly training visual-only and audio-visual branches to simultaneously eliminate inter-modal interference; ii) we design Text-Guided Clues Miner (TGCM) to discover crucial locating clues via considering both consistency and complementarity during audio-visual interaction guided by text semantics; iii) we propose a novel curriculum-based denoising optimization strategy, where we adaptively evaluate sample difficulty as a measure of noise intensity in a self-aware fashion. Extensive experiments show the state-of-the-art performance of our method.

Installation

  1. Clone the repository
git clone https://github.com/hlchen23/ADPN-MM.git
  1. Set up the environment

Use Anaconda and easily build up the required environment by

cd ADPN-MM
conda env create -f env.yml
  1. Data Preparation

We use GloVe-840B-300d for text embeddings, I3D visual features and PANNs audio features for Charades-STA dataset, and C3D visual features and VGGish audio features for ActivityNet Captions dataset. Download data, touch ADPN-MM/data, and ensure the following directory structure.

|--data
|  |--dataset
|     |--activitynet
|     |     |--train.json
|     |     |--val_1.json
|     |     |--val_2.json
|     |--charades
|     |     |--charades_sta_test.txt
|     |     |--charades_sta_train.txt
|     |     |--charades.json
|  |--features
|     |--activitynet
|     |     |--audio
|     |     |     |--VGGish.pickle
|     |     |--c3d_video
|     |     |     |--feature_shapes.json
|     |     |     |--v___c8enCfzqw.npy
|     |     |     |--...(*.npy)
|     |--charades
|     |     |--audio
|     |     |     |--0A8CF.npy
|     |     |     |--...(*.npy)
|     |     |--i3d_video
|     |     |     |--feature_shapes.json
|     |     |     |--0A8CF.npy
|     |     |     |--...(*.npy)
  1. Training

We provide bash scripts to train our model, train_ch.sh and train_anet.sh. For example,

python main.py --task <charades|activitynet> --mode train --gpu_idx <GPU INDEX>
  1. Inference

We provide bash scripts to execute the inference process, test_ch.sh and test_anet.sh. For example,

python main.py --task <charades|activitynet> --mode test --gpu_idx <GPU INDEX> --test_path checkpoint/<charades|activitynet>

put the config.json and <charades|activitynet>_ckpt.t7 files under the --test_path.

Check more information from args for training and test in main.py.

Acknowledgement

We follow the repo VSLNet for the code-running framework to quickly implement our work. We appreciate this great job.

Cite

If you feel this repo is helpful to your research, please cite our work.

@inproceedings{chen2023curriculum,
  title={Curriculum-Listener: Consistency-and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding},
  author={Chen, Houlun and Wang, Xin and Lan, Xiaohan and Chen, Hong and Duan, Xuguang and Jia, Jia and Zhu, Wenwu},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={3117--3128},
  year={2023}
}