Author: Irene Martin, Tampere University Email.
- Clone repository from Github.
- Install requirements with command:
pip install -r requirements.txt
. - Extract features from the audio files previously downloaded
python feature_extraction.py
. - Run the task specific application with default settings
python task4b.py
or./task4b.py
To setup Anaconda environment for the system use following:
conda create --name dcase-t4b python=3.6
conda activate dcase-t4b
conda install numpy
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
pip install torchinfo
pip install librosa
pip install pandas
pip install sklearn
pip install sed_eval
pip install dcase_util
pip install sed_scores_eval
This is the baseline system for the subtask of the Sound Event Detection task 4 of the Acoustic Scene Classification in Detection and Classification of Acoustic Scenes and Events 2023 (DCASE2023) challenge. The system is intended to provide a simple entry-level state-of-the-art approach that gives reasonable results. The baseline system is built on dcase_util toolbox (>=version 0.2.16).
Participants can build their own systems by extending the provided baseline system. The system is very simple, it does not handle dataset download, but a simple feature extraction code is provided. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the soft label scenario, numbers between 0 and 1.
If participants plan to publish their code to the DCASE community after the challenge, building their approach on the baseline system could potentially make their code more accessible to the community. DCASE organizers strongly encourage participants to share their code in any form after the challenge.
extract_features.py # Code to extract features from the development files
MAESTRO Real - Multi-Annotator Estimated Strong Labels is used as development dataset for this task.
This task is a subtopic of the Sound Event Detection Task 4, which provides three kinds of data for training; weakly-labeled data (without timestamps), strongly-labeled data (with timestamps) and unlabeled data. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording
This task is concerned about another type of training data:
- Soft labels provide as a number between 0 and 1 that characterize the certainty of human annotators for the sound at that specific time.
- Temporal resolution of the provided data is 1 second (due to the annotation procedure)
- Development data is provided with both soft (between 0 and 1) labels.
- Systems will be evaluated against hard labels
The task specific baseline system is implemented in file model.py
.
The system implements a convolutional recurrent neural network (CRNN) based approach, with three CNN layers and one bi-directional gated recurrent unit (GRU) layer. As input, the model uses mel-band energies extracted using a hop length of 200 ms and 64 mel filter banks.
- Analysis frame 400 ms (50% hop size)
- Mel-band energies (64 bands)
-
Input shape: sequence_length * 64
-
Architecture:
- CNN layer #1
- 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 5)) + Dropout (rate: 20%)
- CNN layer #2
- 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
- CNN layer #3
- 2D Convolutional layer (filters: 32, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
- Permute
- Bidirectional #1
- Dense layer #1
- Dense layer (units: 64, activation: Linear )
- Dropout (rate: 30%)
- Dense layer #2
- Dense layer (units: 32, activation: Linear )
- CNN layer #1
-
Learning (epochs: 150, batch size: 32, data shuffling between epochs)
- Optimizer: Adam (learning rate: 0.001)
-
Model selection:
- Approximately 30% of the original training data is assigned to validation set
- Model performance after each epoch is evaluated on the validation set, and best performing model is selected
Network summary
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 1, 200, 64)] 0
_________________________________________________________________
conv2d (None, 128, 200, 64) 1280
_________________________________________________________________
batch_normalization (None, 128, 200, 64) 256
_________________________________________________________________
max_pooling2d (None, 128, 200, 12) 0
_________________________________________________________________
dropout (None, 128, 200, 12) 0
_________________________________________________________________
conv2d_1 (None, 128, 200, 12) 147584
_________________________________________________________________
batch_normalization_1 (None, 128, 200, 12) 256
_________________________________________________________________
max_pooling2d_1 (None, 128, 200, 6) 0
_________________________________________________________________
dropout_1 (None, 128, 200, 6) 0
_________________________________________________________________
conv2d_2 (None, 128, 200, 6) 147584
_________________________________________________________________
batch_normalization_2 (None, 128, 200, 6) 256
_________________________________________________________________
max_pooling2d_2 (None, 128, 200, 3) 0
_________________________________________________________________
dropout_2 (None, 128, 200, 3) 0
_________________________________________________________________
permute (None, 200, 128, 3) 0
_________________________________________________________________
reshape_1 (None, 200, 384) 0
_________________________________________________________________
bidirectional (None, 200, 64) 80256
_________________________________________________________________
Linear_1 (None, 200, 32) 2080
_________________________________________________________________
Linear_2 (None, 200, 17) 561
=================================================================
A cross-validation setup is used to evaluate the performance of the baseline system. Micro-averaged-scores (ER_m, F1_m) and macro-averaged-score (F1_M) are calculated using sed-eval toolbox segment-based 1 second. Macro-averaged score class-wise with optimal threshold (F1_{th_op}) is calculated using sed-scores-eval segment-based 1 second.
| | ER_m | F1_m | F1_M | F1_{th_op} |
|----------|--------------------|---------------------|---------------------|---------------------|
| Baseline | 0.487 (+/-0.009) | 70.34% (+/-0.766) | 35.83% (+/-0.660) | 42.87% (+/-0.840) |
Note: The reported system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results. Results from the table are obtained from training and testing the model 10 times, mean and standard deviation of the performance from these 10 independent trials are shown.
- Total params: 380,113
- Trainable params: 380,113
- Non-trainable params: 0
- MACCs (M): 563.741
- Params size (MB): 1.52
For running the CRNN model:
extract_features.py
, first extract mel-bands and normalize datatask4b.py
, DCASE2023 baseline for Task4B
The code is built on dcase_util toolbox, see manual for tutorials. The machine learning part of the code is built on Pytorch (v1.10.2).
.
├── task4b.py # Baseline system for subtask B
|
├── utils.py # Common functions shared between tasks
├── data_generator.py # File for the dataset
├── extract_features.py # Functions to extract mel-band features and normalize
├── config.py # Common parameters
├── evaluate.py # Perform model evaluation, sed-eval segment-based
├── model.py # CRNN model implementation
|
├── development_folds # Folder with the splits for 5-CV
| - fold1_train.csv
| - fold1_val.csv
| - fold1_test.csv
| - ...
├── metadata
| - development_metadata.csv # File duration information to calcualte sed-scores-eval
| - gt_dev.csv # Ground truth labels (hard-labels)
|
├── development_split.csv # List all the files
├── README.md # This file
└── requirements.txt # External module dependencies