/learning_to_localize_sound_source

Codebase and Dataset for the paper: Learning to Localize Sound Source in Visual Scenes

Primary LanguagePython

Learning to Localize Sound Source in Visual Scenes [CVPR 2018,TPAMI 2020]

The codebase is the re-implementation of the code that was used in CVPR 2018 Learning to Localize Sound Source in Visual Scenes and TPAMI Learning to Localize Sound Source in Visual Scenes: Analysis and Applications papers. Original code was written in the early version of Tensorflow so that we re-implemented it in PyTorch for the community.

Getting started

  • tqdm
  • scipy

Preparation

  • Training Data

    • We used 144k samples from Flickr-SoundNet dataset for training as it is mentioned in the paper.
    • Sound features are directly obtained from SoundNet implementation. We apply average pooling on the output of "Object" branch of Conv8 layer and use it as sound feature in our architecture.
    • To be able to use our dataloader (Sound_Localization_Dataset.py);
      • Each sample folder should contain frames as .jpg and audio features as .mat extensions. For details please refer to Sound_Localization_Dataset.py
        • /hdd/SoundLocalization/dataset/12015590114.mp4/frame1.jpg
        • /hdd/SoundLocalization/dataset/12015590114.mp4/12015590114.mat
  • The Sound Localization Dataset (Annotated Dataset)

    The Sound Localization dataset can be downloaded from the following link:

    https://drive.google.com/open?id=1P93CTiQV71YLZCmBbZA0FvdwFxreydLt

    This dataset contains 5k image-sound pairs and their annotations in XML format. Each XML file has annotations of 3 annotators.

    test_list.txt file includes the id of every pair that is used for testing.

Training

python sound_localization_main.py --dataset_file /hdd3/Old_Machine/sound_localization/semisupervised_train_list.txt  
--val_dataset_file /hdd3/Old_Machine/sound_localization/supervised_test_list.txt 
--annotation_path /hdd/Annotations/xml_box_20  --mode train --niter 10 --batchSize 30 --nThreads 8 --validation_on True 
--validation_freq 1 --display_freq 1 --save_latest_freq 1 --name semisupervised_sound_localization_t1 
--optimizer adam --lr_rate 0.0001 --weight_decay 0.0

Pretrained Model

We provide pre-trained model for semisupervised architecture. Accuracy is slightly lower than reported number in the paper (Because of re-implementation in another framework). You can download the model from here.

If you end up using our code or dataset, we ask you to cite the following papers:

@InProceedings{Senocak_2018_CVPR,
author = {Senocak, Arda and Oh, Tae-Hyun and Kim, Junsik and Yang, Ming-Hsuan and So Kweon, In},
title = {Learning to Localize Sound Source in Visual Scenes},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}
@article{Senocak_2020_TPAMI,
title = {Learning to Localize Sound Source in Visual Scenes: Analysis and Applications},
author = {Senocak, Arda and Oh, Tae-Hyun and Kim, Junsik and Yang, Ming-Hsuan and So Kweon, In},
journal = {TPAMI},
year = {2020},
publisher = {IEEE}
}

Image-sound pairs are collected by using the Flickr-SoundNet dataset. Thus, please cite the Yahoo dataset the Yahoo dataset and SoundNet paper as well.

The dataset and the code must be used for research purposes only.