Deep Movie Trailer Generation

To extract all audio features of audio files in a folder: Use: python -t ./audio.list -m 25 -x 26 -s -p extract -o ./sound_feature

Download the sound8.npy model. Libraries in this project is prone to incompatibility. Use docker for safety.

Generate audio.list using


TensorFlow implementation of "SoundNet" that learns rich natural sound representations.

Code for paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

from soundnet


  • Linux
  • NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
  • Python 2.7 with numpy or Python 3.5
  • Tensorflow 1.0.0 (up to 1.3.0)
  • librosa

Getting Started

  • Clone this repo:
git clone
cd SoundNet-tensorflow
  • Pretrained Model

I provide pre-trained models that are ported from soundnet. You can download the 8 layer model here. Please place it as ./models/sound8.npy in your folder.

  • Data

Prepare you input mp3 files and place them under ./data/

Generate a input file txt and place it under ./


Follow the steps in extract features

  • NOTE

If you found out that some audio with offset value start in FFMPEG will cause a tremendous difference between torch audio and librosa, please convert it with following command.

sox {input.mp3} {output.mp3} trim 0

After this, the result might be much better.


For demo, you can follow the following steps

i) Download a converted npy file demo.npy and place it under ./data/

ii) To extract multiple features from a pretrained model with torch lua audio loaded sound track: The sound track is equivalent with torch version.

python -m {start layer number} -x {end layer numbe} -s

Then you can compare the outputs with torch ones.

Feature Extraction

Minimum example

i) Download input file demo.mp3 and place it under ./data/

ii) Prepare a file list in txt format (demo.txt) that includes the input mp3 file(s) and place it under ./


iii) Then extract features from raw wave in demo.txt: Please put the demo mp3 under ./data/demo.mp3

python -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt

More options

To extract multiple features from a pretrained model with downloaded mp3 dataset:

python -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract

e.g. extract layer 4 to layer 17 and save as ./sound_out/tf_fea%02d.npy:

python -o sound_out -m 4 -x 17 -s -p extract

More details are in:

python -h


To train from an existing model:



To train from scratch:

python -p train

To extract features:

python -p extract -m {start layer number} -x {end layer numbe} -s

More details are in:

python -h


  • Change audio loader to soundnet format
  • Make it compatible to Python 3 format
  • Batch Norm behaviour different from Torch
  • Fix conv8 padding issue in training phase
  • Change all config into
  • Change dummy distribution of scene and object to useful placeholder
  • Add sound and feature loader from Data section

Known issues

  • Loaded audio length is not consist in torch7 audio and librosa. Here is the issue
  • Training with a short length audio will make conv8 complain about output size would be negative


  • Why my loaded sound wave is different from torch7 audio to librosa: Here is my WiKi


Code ported from soundnet. And Torch7-Tensorflow loader are from tf_videogan. Thanks for their excellent work!


Hou-Ning Hu / @eborboihuc