/whistle_detection

Primary LanguagePythonMIT LicenseMIT

Whistle Detection

Note: This repository is still under heavy development

About

This repository implements a classification of whistle sounds using a ResNet neural network architecture. We provide a simple Python API and scripts for training purposes.

Dataset

We make use of this dataset, provided by the RoboCup Standard Platform League (SPL).

The dataset contains 10 sound recordings as .wav files from RoboCup matches. Additionally, a .json file is provided, containing labels for each sound file. The labels describe the start and end sample-index of each whistle contained in the recordings (for each audio-channel).

This is a example of the .json format:

{
  "audioFiles":
  [
    {
      "path": "recording1.wav",
      "channels":
      [
        {
          "completelyLabeled": true,
          "whistleLabels":
          [
            {
              "start": 3957000,
              "end": 4020000
            },
            {
              "start": 4611000,
              "end": 4700000
            },
            {...}
          ]
        },
        {...}
      ]
    },
    {...}
  ]
}

Additions and Modifications

TODO

Installation

Installing from source using poetry

For normal training and evaluation we recommend installing the package from source using a poetry virtual environment.

git clone https://github.com/bit-bots/whistle_detection.git  # Downloads the source code
cd whistle_detection  # Change current directory
pip3 install poetry --user  # Install poetry tool
poetry install  # Use poetry to install our dependencies in a encapsulated environment

You need to join the virtual environment by running the command poetry shell in this directory before running any of the following commands without the poetry run prefix.

Usage

TODO

API

TODO

Training

TODO

Approach

During the implementation phase, we discussed various approaches, which will be described down below.

Our Current Approach

Our approach was inspired by some of these jupyter notebooks.

It takes audio chunks/samples of a configurable length and sample rate and produces a binary classification. I.e., does the chunk contain a whistle sound?

For our training process, we use the dataset as described above. All recordings get resampled to a configurable sample rate by our dataloader.

Training epochs are not deterministic, but the same number of random chunks get selected from the whole dataset on the fly. Random waveform chunks (of configurable duration) get selected from the recording files. This can be seeded, to provide determinism between runs. A train/test-split can be configured, which splits the list of recoding files, such that always (persistent over all epochs) the first n% of shuffled files will be used for training, the remaining files will be used for validation. The waveform chunks get transformed to their Mel-Spectrogram (image) to be fed as an input to the neural network. As a neural network architecture, we use a ResNet18, where the last layer is replaced by a linear layer, which finally outputs a single value between 0 and 1. Using a configurable confidence threshold, this value can be converted to a single binary value, representing whether a whistle sound was detected during the audio chunk.

Other Approaches

We discarded the following approaches, as they will likely have problems with aliasing and using neural networks probably won't have a large benefit over conventional audio classification.

  • Using short audio snippets (~10 ms) for binary whistle classification, clump consecutive whistle detections together to return length of sound
    • Inputting samples directly
    • Inputting a FT of the snippet
    • Using a neural network or "simple" peak detection

We skipped this approach, because of increased complexity. We wanted to try our current approach first.

  • Using longer audio snippets (~100 ms - 1 s) and a UNet-like architecture to detect the "position" of the whistle sound during the snippet

Continuous Integration (CI)