This repo contains an implementation of a classifier based on a convolutional neural network, to solve the environmental sound classification task. The CNN is trained using audio data taken from the UrbanSound8K dataset, with the addition of various feature extraction techniques:
- Audio augmentations, as seen in the paper Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification, Justin Salamon and Juan Pablo Bello. In particular, the Dynamic Range Compression augmentation is performed using the MUDA library.
- Spectrogram deltas and delta-deltas, as seen in the paper Environmental Sound Classification with Convolutional Neural Networks, Karol J. Piczak
- Spectrogram image augmentation techniques (mostly found on the web and by experimentation)
- As a preliminary step it is strongly advised to setup a conda virtual environment with Python 3.7 (which can be done by following the guide here).
- Then of course clone this project using:
git clone https://github.com/EmanueleMusumeci/UrbanSound8K-CNN-sound-classification
- Make sure that the CUDA toolkit was correctly installed.
- Install pytorch, torchvision and torchaudio using conda:
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
- The root folder of this project contains requirements.txt, a file that lists all the packages required to run this project. Install all the required packages with:
pip install -r requirements.txt
While there shouldn't be any problem with installing the MUDA library on Ubuntu, there might be problems importing the sox library on Windows (which happens inside the MUDA library). Please follow this guide to fix this problem on Windows.
- Follow the instructions at the UrbanSound8K dataset webpage to download the dataset
- Create a folder named "data" in the root directory of the project (with write permissions as well if on Ubuntu)
- Extract the downloaded dataset UrbanSound8K.tar.gz inside the folder just created
NOTICE: the dataset is divided into folds, each comprising up to 1000 samples (audio clips), totalling to 6GB of data.
Although this is step is optional, it is STRONGLY ADVISED to first perform this step as the preprocessing of this dataset is computationally heavy (just an epoch might require up to 1h), while after performing this step (which usually requires around 2h30m), an epoch requires just up to 2 minutes.
To achieve this results, the dataset is loaded fold by fold and spectrograms are generated for each clip. If preprocessing is applied, preprocessed clips and spectrograms are generated for each preprocessing value (the model will then proceed to load a preprocessed clip or spectrogram for a random preprocessing value, at training time). All this data will then be saved to a memory-mapped file to speed up loading (up to a factor of 13X).
To perform this step open a terminal in the root directory of the project and run:
python compact_dataset.py -h
to see all possible command line arguments.
NOTICE: To prevent overwriting previously compacted folds, an attempt to do so will result in the termination of the program. To bypass this failsafe use the --overwrite_existing_folds
argument when launching the script.
If you have some issues with running python scripts from command-line or just don't like it, you can manually run the script
compact_dataset_manual.py
that will compact the dataset with standard settings.
NOTICE: Also in this case, to circumvent the overwrite failsafe, you'll have to manually set the flag OVERWRITE_EXISTING_FOLDS to True.
To start the training you'll have to run the train.py
script as follows:
python train.py -h
to see all possible command-line arguments.
A generic use case, where no preprocessing is applied, is the following:
python train.py -n TRAINING_INSTANCE_NAME --dataset_dir "data" --epochs 50
where TRAINING_INSTANCE_NAME is the name we want to give to this training instance. This command launches the training without using pre-compacted data, which is STRONGLY DISCOURAGED (training time will be way longer, see previous section for more info).
Instead, if you have previously compacted audio clips use the --load_compacted_audio
argument, although it is advisable to compact spectrograms instead (which is done by default when running the compact_dataset.py
or the compact_dataset_manual.py
script) and train using the --load_compacted_spectrograms
argument (using compacted spectrograms allows the maximum speed-up).
To just train with few samples and without any preprocessing, just to have a "taste" of the model, use the
--test_mode
argument.
To try a deeper model instead of the one in the paper (1), use the --custom_cnn
argument.
To apply a preprocessing, use the argument --preprocessing_name PREPROCESSING_NAME
where PREPROCESSING_NAME can be one of the following:
- PitchShift1
- PitchShift2
- TimeStretch
- DynamicRangeCompression
- BackgroundNoise
(each one of them is discussed in the report.pdf)
If a model with the same TRAINING_INSTANCE_NAME was previously generated, the script will terminate to avoid overwriting it. To disable this failsafe use the
--disable_model_overwrite_protection
argument.
--compute_deltas
and --compute_delta_deltas
will apply the preprocessing techniques described in paper (2).
--apply_spectrogram_image_background_noise
and --apply_spectrogram_image_shift
will instead apply the spectrogram image augmentation techniques.
To tune regularization use --dropout_probability DROPOUT_PROBABILITY
or --weight_decay WEIGHT_DECAY
.
NOTICE: if you don't want to use command-line arguments, you can launch the script train_manual.py
, where every setting has to be edited manually.