/MAFnet

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

MAFnet

Multi-level Attention Fusion Network (MAFnet) is a multimodal network that can fuse dynamically visual and audio information for audio-visual event recognition.

We release the testing code along trained models.

Authors

MAFnet

The proposed MAFnet architecture is shown below. One video is splited into T non-overlapping clips. Then, audio and visual information are extracted with two pretrained CNNs: DenseNet [45] for visual features and VGGish [46] for audio features. The clip features are further fed into modality & temporal attention module to build a global feature containing multimodal and temporal information. This global feature is then used to predict the label of the video. A lateral connection between visual and audio pathways is added trough the FiLM layer [44].

The trained model can be downloaded here.

AVE Dataset & Features

We train and test our model on the AVE Dataset [1]

Audio and visual feature can be downloaded here. Audio feature are extracted with a VGGish network [2] and visual feature are extracted with DenseNet [3]

Scripts for generating audio and visual features are in feature_extractor folder (Feel free to modify and use it to process your audio-visual data)

Requirements

  • Python-3.6

  • Tensorflow-gpu-1.15

  • Keras

  • Scikit-learn

  • pillow

  • resampy

  • ffmpeg

  • pickle

Training and testing scripts

To train the network:

python train.py --train

To test the network:

python train.py

References

[1] TIAN, Yapeng, SHI, Jing, LI, Bochen, et al. Audio-visual event localization in unconstrained videos. In : Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 247-263. Paper Download link

[2] HERSHEY, Shawn, CHAUDHURI, Sourish, ELLIS, Daniel PW, et al. CNN architectures for large-scale audio classification. In : IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. p. 131-135. Paper

[3] HUANG, Gao, LIU, Zhuang, VAN DER MAATEN, Laurens, et al. Densely connected convolutional networks. In : Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 4700-4708. Paper

Acknowledgments

Thanks to CHISTERA IGLU and the European Regional Development Fund (ERDF) for funding.

Audio features are extracted using VGGish and visual features are extracting using DenseNet. We thank the authors for sharing their codes.