/SoundNet-tensorflow

TensorFlow implementation of "SoundNet".

Primary LanguagePython

SoundNet-tensorflow

TensorFlow implementation of "SoundNet" that learns rich natural sound representations.

Code for paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

from soundnet

Prerequisites

  • Linux
  • NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
  • Python 2.7 with numpy or Python 3.5
  • Tensorflow 1.0.0 (up to 1.3.0)
  • librosa

Getting Started

  • Clone this repo:
git clone git@github.com:eborboihuc/SoundNet-tensorflow.git
cd SoundNet-tensorflow
  • Pretrained Model

I provide pre-trained models that are ported from soundnet. You can download the 8 layer model here. Please place it as ./models/sound8.npy in your folder.

  • Data

Prepare you input mp3 files and place them under ./data/

Generate a input file txt and place it under ./

./data/0001.mp3
./data/0002.mp3
./data/0003.mp3
...

Follow the steps in extract features

  • NOTE

If you found out that some audio with offset value start in FFMPEG will cause a tremendous difference between torch audio and librosa, please convert it with following command.

sox {input.mp3} {output.mp3} trim 0

After this, the result might be much better.

Demo

For demo, you can follow the following steps

i) Download a converted npy file demo.npy and place it under ./data/

ii) To extract multiple features from a pretrained model with torch lua audio loaded sound track: The sound track is equivalent with torch version.

python extract_feat.py -m {start layer number} -x {end layer numbe} -s

Then you can compare the outputs with torch ones.

Feature Extraction

Minimum example

i) Download input file demo.mp3 and place it under ./data/

ii) Prepare a file list in txt format (demo.txt) that includes the input mp3 file(s) and place it under ./

./data/demo.mp3

iii) Then extract features from raw wave in demo.txt: Please put the demo mp3 under ./data/demo.mp3

python extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt

More options

To extract multiple features from a pretrained model with downloaded mp3 dataset:

python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract

e.g. extract layer 4 to layer 17 and save as ./sound_out/tf_fea%02d.npy:

python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract

More details are in:

python extract_feat.py -h

Finetuning

To train from an existing model:

python main.py 

Training

To train from scratch:

python main.py -p train

To extract features:

python main.py -p extract -m {start layer number} -x {end layer numbe} -s

More details are in:

python main.py -h

TODOs

  • Change audio loader to soundnet format
  • Make it compatible to Python 3 format
  • Batch Norm behaviour different from Torch
  • Fix conv8 padding issue in training phase
  • Change all config into tf.app.flags
  • Change dummy distribution of scene and object to useful placeholder
  • Add sound and feature loader from Data section

Known issues

  • Loaded audio length is not consist in torch7 audio and librosa. Here is the issue
  • Training with a short length audio will make conv8 complain about output size would be negative

FAQs

  • Why my loaded sound wave is different from torch7 audio to librosa: Here is my WiKi

Acknowledgments

Code ported from soundnet. And Torch7-Tensorflow loader are from tf_videogan. Thanks for their excellent work!

Author

Hou-Ning Hu / @eborboihuc