This project uses the older version of pytorch, installed using pip.
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
The generate-data-set.ipynb
notebook contains following steps:
- Store all the training data in DATA_PATH
- Use the same name for audio file and label file, for example:
- Audio file name: "audio-1.wav"
- Label file name: "audio-1.json"
- For more details about audio and label format, see Audio-Spectrum-Labeling-Toolset
- Read in audio data
- Convert the audio into spectrogram
- Slice the spectrogram into overlapping windows
- TIME_SCALE is the scale of time span for each window. For example, if TIME_SCALE is 1, then each window will be a square. The length of each window will be equal to the height of audioSpectrogram.
- For species making longer sounds, increase TIME_SCALE to make sure their audio fits in the window.
The spectrogram is enhanced using following methods:
- Normalize the spectrogram to 0-1:
- Subtract the entire spectrogram by the minimum value in the spectrogram so that the minimum value is 0.
- Divide the entire spectrogram by the maximum value in the spectrogram so that the maximum value is 1.
- Enhance the spectrogram using
$f(x) = 1 - (1-x)^{\text{ENHANCE-FACTOR}}$
After enhancement, a sliding window is applied for normalized spectrogram size. Neighboring windows will overlap for 1/2 of the window size.
The generated windows may look like this:
For each window, selective search algorithm will be applied to generate region proposal.
For each proposal, and ground truth label, the overlapping area ration will be calculated. Regions with good performance will be choosen as positive proposals.
After data set is generated, bounding-box-classifier.ipynb
can use the data set to train a CNN model to classify the region proposal, and suggest offset.
For each proposal in the training set, a label (positive/negative) is given. And BCELoss is apply as the loss function in this situation.
Also, bounding box offset is also trained so the neural network can suggest the offset for each proposal.
The entire RCNN takes 2 inputs:
- The normalized spectrogram image in 1x64x64 format
- Meta data in 1x3 format:
- Start frequency of the window
- End frequency of the window
- Time span of the window
The neural network computes the following outputs:
- If the window is positive, indicates the proposal is good or bad.
- The suggested offset for the window.
For example, following output means the neural network thinks the proposal is bad.
And it suggests to move the window anchor 2.76 pixels left, and 0.16 pixels down, expand the window width by 4.97 pixels, and expand the window height by 1 pixel for better performance.
(array([[0.36625865]], dtype=float32),
array([[-2.7618032 , -0.16448733, 4.9751573 , 1.0600426 ]],
dtype=float32))
With 110 seconds of audio, the data generator can generate about 1000 positive samples, and 3000 negative samples.
Learning 100 epochs will reach 80% accuracy on the test set.
Although the classification accuracy is only 80%, the final performance can be good since the neural network can suggest the offset for each proposal.
The audio-detection.ipynb
notebook contains following steps:
- Load audio and RCNN model.
- Selective search proposals.
- Use RCNN model to filter the proposals.
- Apply the suggested offset to the proposals for final detection.