Large-Scale Bird Sound Classification using Convolutional Neural Networks

By Stefan Kahl, Thomas Wilhelm-Stein, Hussein Hussein, Holger Klinck, Danny Kowerko, Marc Ritter, and Maximilian Eibl

Introduction

Code repo for our submission to the LifeCLEF bird identification task BirdCLEF2017. This is a refined version of our original code described in the working notes. We added comments and removed some of the boilerplate code. If you have any questions or problems running the scripts, don't hesitate to contact us.

Contact: Stefan Kahl, Technische Universität Chemnitz, Media Informatics

E-Mail: stefan.kahl@informatik.tu-chemnitz.de

This project is licensed under the terms of the MIT license.

Please cite the paper in your publications if it helps your research.

@article{kahl2017large,
  title={Large-Scale Bird Sound Classification using Convolutional Neural Networks},
  author={Kahl, Stefan and Wilhelm-Stein, Thomas and Hussein, Hussein and Klinck, Holger and 
  Kowerko, Danny and Ritter, Marc and Eibl, Maximilian},
  journal={Working notes of CLEF},
  year={2017}
}

You can download our working notes here: TUCMI BirdCLEF Working Notes PDF

Installation

This is a Thenao/Lasagne implementation in Python for the identification of hundreds of bird species based on their vocalizations. This code is tested using Ubuntu 14.04 LTS but should work with other distributions as well.

First, you need to install Python 2.7 and the CUDA-Toolkit for GPU acceleration. After that, you can clone the project and run the Python package tool PIP to install most of the relevant dependencies:

git clone https://github.com/kahst/BirdCLEF2017.git
cd BirdCLEF2017
sudo pip install –r requirements.txt

We use OpenCV for image processing; you can install the cv2 package for Python running this command:

sudo apt-get install python-opencv

Finally, you need to install Theano and Lasagne:

sudo pip install -r https://raw.githubusercontent.com/Lasagne/Lasagne/master/requirements.txt
sudo pip install https://github.com/Lasagne/Lasagne/archive/master.zip

You should follow the Lasagne installation instructions for more details: http://lasagne.readthedocs.io/en/latest/user/installation.html

Training

In order to reproduce our 2017 submission, you need to download the BirdCLEF2017 training data. Nonetheless, the code will work with all other bird recordings obtained from sources like Xeno-Canto, eBird or others.

Dataset

The training script uses subfolders as class names and you should provide following directory structure:

dataset   
¦
+---species1
¦   ¦   file011.wav
¦   ¦   file012.wav
¦   ¦   ...
¦   
+---species2
¦   ¦   file021.wav
¦   ¦   file022.wav
¦   ¦   ...
¦    
+---...

For the BirdCLEF2017 training data, you can use the script birdCLEF_sort_data.py providing the paths of WAV and XML directories. We used a 10% local validation split for testing. You can separate files for testing from the training data by running the script birdCLEF_validation_split.py and specifiying the path of your sorted dataset.

Extracting Spectrograms

We decided to use magnitude spectrograms with a resolution of 512x256 pixels, which represent five-second chunks of audio signal. You can generate spectrograms for your sorted dataset with the script birdCLEF_spec.py. You can switch to different settings for the spectrograms or change the heuristic which separates bird sounds from noise by editing the file.

Extracting spectrograms might take a while. Eventually, you should end up with a directory containing subfolders named after bird species, which we will use as class names during training.

Note: You need to remove the “noise” folder containing rejected spectrograms without bird sounds from the training data.

Training a Model

You can train your own model using either the BirdCLEF2017 training data or your own sound recordings. All you need are spectrograms of the recordings. Before training, you should review the following settings, which you can find in the birdCLEF_train.py file:

DATASET_PATH containing the spectrograms (subfolders as class names)
NOISE_PATH containing noise samples (you can download the samples we used here or select your own from the noise folder with rejected spectrograms)
MAX_SAMPLES_PER_CLASS limiting the number of spectrograms per bird species (Default = 1500)
VAL_SPLIT which defines the amount of spectrograms in percent you like to use for monitoring the training process (Default = 0.05)
MULTI_LABEL for softmax outputs (False) or sigmoid outputs (True); Activates batch augmentation (see working notes for details)
IM_SIZE defining the size of input images, spectrograms will be scaled accordingly (Default = 512x256 pixels)
IM_AUGMENTATION selecting different techniques for dataset augmentation
MODEL_TYPE being either 1, 2 or 3 depending on the model architecture you like to train (see working notes for details)
BATCH_SIZE defining the number of images per batch; reduce batch size to fit model in less GPU memory (Default = 128)
LEARNING_RATE for scheduling the learning rate; use LR_DESCENT = True for linear interpolation and LR_DESCENT = False for steps
PRETRAINED_MODEL if you want to use a pickle file of a previously trained model; set LOAD_OUTPUT_LAYER = False if model output size differs (you can download a pre-trained model here)
SNAPSHOT_EPOCHS in order to continuously save model snapshots; select [-1] to save after every epoch; the best model params will be saved automatically after training

There are a lot more options - most should be self-explanatory. If you have any questions regarding the settings or the training process, feel free to contact us.

Note: In order to keep results reproducible with fixed random seeds, you need to update your .theanorc file with the following lines:

[dnn.conv]
algo_bwd_filer=deterministic
algo_bwd_data=deterministic

Depending on your GPU, training will take while. Training with all 940k specs from the BirdCLEF2017 training data takes 1-2 hours per epoch on a NVIDIA P6000 and 2-4 hours on a NVIDIA TitanX depending on the type of model architecture used.

Evaluation

After training, you can test models and evaluate them on your local validation split. Therefore, you need to adjust the settings in birdCLEF_evaluate.py to match your model hyperparameters. The most important settings are:

IM_SIZE, MODEL_TYPE, MULTI_LABEL and TRAINED_MODEL where you specify the pickle file of your pre-trained model and the corresponding model architecture
BATCH_SIZE to fit forward pass into GPU memory and most importantly to generate timestamps for soundscapes if set to 1
SPEC_LENGTH and SPEC_OVERLAP to test different numbers of specs per sound file; when processing soundscapes, you should set SPEC_OVERLAP = 0
INCLUDE_BG_SPECIES is rather BirdCLEF specific and lets you decide if you want to evaluate background species, too. (Note: Not all background species listed in the xml files are relevant, therefore you should set ONLY_BG_SPECIES_FROM_CLASSES = True)
You can save predictions for ensemble pooling if you specify an EVAL_ID and set SAVE_PREDICTION = True. Next time you start the script, you can load this prediction and it will be merged with the prediction of the current model

If you use any other than the BirdCLEF trainig data, you will have to adjust your ground truth before you can evaluate. You should do this by implementing the getGroundThruth() function of the script.

Testing

If you want to make predictions for a single, unlabeled wav-file, you can call the script birdCLEF_test.py via the command shell. You can use this script as is, no training required. Simply follow these steps:

1. Download pre-trained model:

sh model/fetch_model.sh

2. Execute script:

python birdCLEF_test.py --filename 'dataset/example_file.wav' --overlap 4 --results 5 --confidence 0.01

If everything goes well, you should see an output just like this:

HANDLING IMPORTS... DONE!
BUILDING MODEL TYPE 1 ...
	FINAL POOL OUT SHAPE: (None, 1024, 4, 8)
...DONE!
MODEL HAS 8 WEIGHTED LAYERS
MODEL HAS 24221980 PARAMS
IMPORTING MODEL PARAMS... DONE!
COMPILING THEANO TEST FUNCTION... DONE! ( 2 s )
TESTING: dataset/example_file.wav
TOP PREDICTION(S):
	Asthenes moreirae gshgib 96 %
PREDICTION FOR 48 SPECS TOOK 1310 ms ( 27 ms/spec )

Note: You do not need to specify values for overlap, results and confidence – those are optional. If you want to change the pre-trained model, model type or multi label settings, you need to edit the script itself.

This repo might not suffice for real-world applications, but you should be able to adapt the testing script to your specific needs.

We will keep this repo updated and will provide more testing functionality in the future.