Visually aligned sound generation via sound-producing motion parsing [paper]

Overview

We propose to tame the visually aligned sound generation by projecting the sound-producing motion to a discriminative temporal visual embedding. This visual embedding can, then, distinguish the transient visual motion from complex background information. which leads to produce high temporal-wise alignment sounds. We refer to it as SPMNet.

News

Code, pre-trained models and all demos will be released here. Welcome to watch this repository for the latest updates.

Demo

Dog

dog_1.mp4

dog_6.mp4

Drum

drum_1.mp4

drum_2.mp4

Firework

firework_1.mp4

firework_2.mp4

Listen for the audio samples on our materials.

Citation

Our paper was accepted by Neurocomputing. Please use this bibtex if you would like to cite our work

@article{Ma2022VisuallyAS,
  title={Visually Aligned Sound Generation via Sound-Producing Motion Parsing},
  author={Xin Ma and Wei Zhong and Long Ye and Qin Zhang},
  journal={Neurocomputing},
  year={2022}
}

Acknowledgments

We acknowledge the following work:

The code base is built upon RegNet repo.
Thanks to SpecVQGAN open source efforts.

mx-mark/SPMNet