A PyTorch implementation of "Leveraging Category Information for Single-Frame Visual Sound Source Separation". Authors: Lingyu Zhu and Esa Rahtu. Tampere University, Finland.
Python>=3.5, PyTorch>=0.4.0
#Not released yet, but you can train the model on your own dataset for now by setting the following info:
-Create a folder data/, place the csv file lists under the folder data/, the csv file has the format as below:
audio_path, frames_path, frames count
-Edit the dataset path at line 163 of file dataset/music.py
Replace the --arch_frame and --arch_sound in scripts/train_locSep.sh and scripts/eval_locSep.sh to switch to diffeent appearance and sound networks.
# Training the A(Res-50) + S(MV2) model
./scripts/train_locSep.sh
# Training the A(Res-50, att) + S(MV2) model
-The network A(Res-50, att) + S(MV2) is trained based on A(Res-50) + S(MV2).
-Uncomment the line of "CUDA_VISIBLE_DEVICES="0" python -u main_Appearance_att_Sound.py $OPTS" in scripts/train_locSep.sh to start the training.
# Training the A(CatEmb) + S(MV2) model
-Uncomment the line of "CUDA_VISIBLE_DEVICES="0" python -u main_GCEmb_Sound.py $OPTS" in scripts/train_locSep.sh to start the training.
# Ajust accordingly based on the selected model
./scripts/eval_locSep.sh
[1] Zhu, Lingyu, and Esa Rahtu. "Visually guided sound source separation using cascaded opponent filter network." Proceedings of the Asian Conference on Computer Vision (ACCV). 2020.
[2] Zhao, Hang, et al. "The sound of pixels." Proceedings of the European conference on computer vision (ECCV). 2018.
[3] Arandjelovic, Relja, and Andrew Zisserman. "Objects that sound." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
If you find this work useful in your research, please cite:
@inproceedings{zhu2021leveraging,
title={Leveraging Category Information for Single-Frame Visual Sound Source Separation},
author={Zhu, Lingyu and Rahtu, Esa},
booktitle={2021 9th European Workshop on Visual Information Processing (EUVIP)},
pages={1--6},
year={2021},
organization={IEEE}
}
This repo is developed based on Sound-of-Pixels.