/Urban-Sound-Classification

Sound Classification using Neural Networks

Primary LanguageJupyter NotebookMIT LicenseMIT

Urban-Sound-Classification

References

Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal processing letters, 24(3), 279-283. https://doi.org/10.1109/LSP.2017.2657381

Dataset

The UrbanSound8k dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.All excerpts are taken from field recordings uploaded to www.freesound.org.
8732 audio files of urban sounds (see description above) in WAV format. The sampling rate, bit depth, and number of channels are the same as those of the original file uploaded to Freesound (and hence may vary from file to file).
The UrbanSound8k dataset used for model training, can be downloaded from the following link: https://urbansounddataset.weebly.com/

Directory Structure

Urban_data_preprocess.ipynb: Pre-processing data and also augmenting data

Urban_nn_model.ipynb: Running 10 fold cross val on original data using simple NN

Urban_cnn_model.ipynb: Running 10 fold cross val on original and augmented data using CNN

Urban_data_generator.ipynb: Contains code for a data-generator that can be used for training with augmented data using CNN

Results

10 Fold Cross Val Accuracy for NN using original data: 57.43%
10 Fold Cross Val Accuracy for CNN using original data: 62.61%
10 Fold Cross Val Accuracy for CNN using augmented data: 63.90%

Future Work

Extend data more by using different parameters for augmentation
Apply Hyperparameter optimization and test different architectures

Features Extracted

Librosa was used for data preprocessing and feature extraction.

MEL Features

MFCC

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum").

MFCC of a dog bark

image

Melspectrogram

A mel-scaled spectrogram.

Melspectrogram of a dog bark

image

Chroma Features

In music, the term chroma feature or chromagram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as "pitch class profiles", are a powerful tool for analyzing music whose pitches can be meaningfully categorized (often into twelve categories) and whose tuning approximates to the equal-tempered scale. One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation.

Chroma_stft

A chromagram from a waveform or power spectrogram.

Chromagram of a dog bark

image

Chroma_cqt

Constant-Q chromagram.

Constant-Q chromagram of a dog bark

image

Chroma_cens

The chroma variant “Chroma Energy Normalized” (CENS).

Chroma cens of a dog bark

image