The dataset is a subset of the AudioSet dataset, consisting of human and emergency sounds. There are 25,834 audio clips in the weakly-labelled train set, 2,380 clips in the strongly-labelled train set, 595 clips in the validation set and 747 clips in the test set. There are a total of 25 classes, as listed below.
Human sounds:
- Applause, 2. Breathing, 3. Chatter, 4. Child_speech_kid_speaking, 5. Cheering, 6. Clapping, 7. Conversation, 8. Cough, 9. Crowd, 10. Crying_sobbing, 11. Female_speech_woman_speaking, 12. Laughter, 13. Male_speech_man_speaking, 14. Run, 15. Screaming, 16. Shout, 17. Sneeze, 18. Walk_footsteps, 19. Whispering
Emergency sounds:
- Air_horn_truck_horn, 2. Car_alarm, 3. Emergency_vehicle, 4. Explosion, 5. Gunshot_gunfire, 6. Siren
If you are interested in downloading this dataset, please use youtube_dl to download the files indicated in metadata/training_set.csv and metadata/testing_set.csv, and store them in dataset/training and dataset/testing respectively:
python download_audioset.py --workspace=$WORKSPACE --data_type=training
python download_audioset.py --workspace=$WORKSPACE --data_type=testing
The pre-trained models can be found in the 'checkpoints' directory and stored according to the parameters they were trained on.
For example, if the pre-trained model was trained with both strongly and weakly-labelled sets on the Cnn-9layers-Gru-FrameAtt model, with clipwise binary crossentropy loss and mixup applied, and with a batch size of 32, it can be found in the following directory:
checkpoints/main_strong/holdout_fold=1/model_type=Cnn_9layers_Gru_FrameAtt/loss_type=clip_bce/augmentation=mixup/batch_size=32
The models are named best_<feature_type>_<audio_quality>.pth, where the feature type may either be 'logmel' or 'gammatone', and the audio_quality may be '8k', '16k' or '32k'.
The pre-trained models were trained on both the weakly-labelled and strongly-labelled 16k train sets, and evaluated on the strongly-labelled test set. The table below shows the performance of the models:
Model | Data Augmentation | Processing Type | Threshold Type | Segment-based Error Rate | Segment-based F1-score |
---|---|---|---|---|---|
Cnn-9 + Gru + Attention | SpecAugment + Mixup | None | Non-optimised | 0.555 | 0.624 |
Cnn-9 + Gru + Attention | SpecAugment + Mixup | None | Optimised | 0.601 | 0.635 |
Cnn-9 + Gru + Attention | SpecAugment + Mixup | Frame-wise averaging | Non-optimised | 0.557 | 0.618 |
Cnn-9 + Gru + Attention | SpecAugment + Mixup | Frame-wise averaging | Optimised | 0.574 | 0.639 |
Cnn-9 + Transformer + Attention | SpecAugment + Mixup | None | Non-optimised | 0.552 | 0.628 |
Cnn-9 + Transformer + Attention | SpecAugment + Mixup | None | Optimised | 0.561 | 0.648 |
Cnn-9 + Transformer + Attention | SpecAugment + Mixup | Frame-wise averaging | Non-optimised | 0.549 | 0.622 |
Cnn-9 + Transformer + Attention | SpecAugment + Mixup | Frame-wise averaging | Optimised | 0.550 | 0.647 |
If you would like to test the performance of the pre-trained model on the test set yourself, please follow the instructions in the Dataset section to download the test set and then run the following commands:
python pytorch/main_strong.py inference_prob_overlap --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --holdout_fold=1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='mixup' --batch_size=32 --feature_type='logmel' --cuda --sed_thresholds --audio_16k
Instructions (more details can be found in run.sh):
-
Upload the audio clips you would like to process to a folder
- The path to this directory will be your $INPUT_DIR
- If your audio files are not in .wav format, the prediction system will automatically convert them from their current format to .wav
-
Run the following command:
python pytorch/predict.py predict --input_dir=$INPUT_DIR --workspace=$WORKSPACE --filename='main_strong' --holdout_fold 1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='mixup' --batch_size=32 --feature_type='logmel' --cuda --sample_duration=5 --overlap --overlap_value=1 --sed_thresholds
-
The prediction output is saved in the 'predict_results' directory in the following xml format:
Note:
-
Integration with Automatic Speech Recognition (ASR)
- If you would like to run the system integrated with DeepSpeech ASR, please run the following command:
python pytorch/predict.py predict_asr --input_dir=$INPUT_DIR --workspace=$WORKSPACE --filename='main_strong' --holdout_fold=1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='mixup' --batch_size=32 --feature_type='logmel' --cuda --sample_duration=5 --overlap --overlap_value=1 --sed_thresholds --language='eng'
- In this system, ASR is activated whenever Male_speech_man_speaking, Female_speech_woman_speaking and Child_speech_kid_speaking events are detected.
-
Weak vs Combined System
- If you would like to run the purely weakly-labelled system, use --filename='main'.
- If you would like to run the combined weakly-labelled and strongly-labelled system, use --filename='main_strong'.
-
Data Augmentation Types
- If you would like to run the system with only mixup applied, use --augmentation='mixup'.
- If you would like to run the system with both timeshift and mixup applied, use --augmentation='timeshift_mixup'.
- The following list indicates the possible augmentation techniques you may apply:
['none', 'spec_augment', 'timeshift', 'mixup', 'timeshift_mixup', 'specaugment_timeshift_mixup', 'specaugment_mixup', 'specaugment_timeshift']
-
Audio Quality Selection
- If you would like to run the system trained on the 8k or 16k dataset, add a '--audio_8k' or 'audio_16k' tag, respectively.
- If you would like to run the system trained on the 32k dataset, no additional tag is required.
Instructions (more details can be found in run.sh):
- Do include '--audio_16k' or '--audio_8k' if you would like to train on the 16k or 8k dataset, respectively.
- No additional tag is needed if you would like to train on the 32k dataset.
-
Prepare data for training by packing the waveforms to hdf5
Pack test set:
python utils/features.py pack_audio_files_to_hdf5 --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --feature_type='logmel' --data_type='testing' --audio_16k
Pack validation set:
python utils/features.py pack_audio_files_to_hdf5 --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --feature_type='logmel' --data_type='strong_validation' --audio_16k
If only doing weak training:
python utils/features.py pack_audio_files_to_hdf5 --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --feature_type='logmel' --data_type='training' --audio_16k
If doing combined weak and strong training:
python utils/features.py pack_audio_files_to_hdf5 --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --feature_type='logmel' --data_type='weak_training' --audio_16k
python utils/features.py pack_audio_files_to_hdf5 --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --feature_type='logmel' --data_type='strong_training' --audio_16k
-
Commence training
If only doing weak training:
python pytorch/main.py train --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --holdout_fold=1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='specaugment_mixup' --learning_rate=1e-3 --batch_size=32 --resume_iteration=0 --stop_iteration=50000 --feature_type='logmel' --cuda
If doing combined weak and strong training:
python pytorch/main_strong.py train --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --holdout_fold=1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='specaugment_mixup' --learning_rate=1e-3 --batch_size=32 --resume_iteration=0 --stop_iteration=50000 --feature_type='logmel' --cuda --audio_16k
-
Optimize thresholds (optional but strongly recommended)
python utils/optimize_thresholds.py optimize_sed_thresholds --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --filename='main_strong' --holdout_fold=1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='specaugment_mixup' --feature_type='logmel' --batch_size=32 --audio_16k
-
Evaluate and Calculate metrics:
If using optimized thresholds:
python pytorch/main_strong.py inference_prob_overlap --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --holdout_fold=1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='specaugment_mixup' --batch_size=32 --feature_type='logmel' --cuda --audio_16k --sed_thresholds
If not using optimized thresholds:
python pytorch/main_strong.py inference_prob_overlap --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --holdout_fold=1 --model_type=$MODEL_TYPE --loss_type='clip_bce' --augmentation='specaugment_mixup' --batch_size=32 --feature_type='logmel' --cuda --audio_16k