This repository contains a simple version of the codebase to reproduce the results of the CP-JKU submission to the DCASE23 Task 1 "Low-complexity Acoustic Scene Classification" challenge. The implemented model CP-Mobile and training procedure scored the top rank in the challenge.
The technical report describing the system can be found here. The official ranking of systems submitted to the challenge is available here.
An extension to the technical report (containing an ablation study and further results) is submitted to the DCASE Workshop and a link to the paper will be provided soon.
Create a conda environment:
conda env create -f environment.yml
Activate environment:
conda activate cpjku_dcase23
Download the dataset from this location and extract the files.
Adapt path to dataset in the file datasets/dcase23.py and provide the location of the extracted "TAU-urban-acoustic-scenes-2022-mobile-development" folder. Put the path in the following variable:
dataset_dir = None
Run training on the TAU22 dataset:
python run_training.py
The configuration can be adapted using the command line, e.g. changing the probability of the device impulse response augmentation:
python run_training.py --dir_prob=0.4
The results are automatically logged using Weights & Biases.
The models can be quantized using Quantization Aware Training (QAT). For this, the trained model from the previous step is loaded by specifying the Wandb ID and fine-tuned using QAT for 24 epochs. The following command can be used:
python run_qat.py --wandb_id=c0a7nzin
Running the training procedure creates a folder DCASE23_Task1. This folder contains subfolder named according to the ID assigned to the experiment by Weights and Biases. These subfolders contain checkpoints which can be used to load the trained models (see run_qat.py for an example).
Default parameters for training on TAU22:
python run_training.py
Fine-tuning and quantizing model using Quantization Aware Training (trained model with wandb_id=c0a7nzin already included in GitHub repo):
python run_qat.py --wandb_id=c0a7nzin
Checkout the results on Weights & Biases.
To train a CP-ResNet teacher model with 128K parameters, run:
python run_cp-resnet_training.py
To fine-tune a pre-trained PaSST teacher model, run:
python run_passt_training.py
The device impulse responses in datasets/dirs are downloaded from MicIRP. All files are shared via Creative Commons license. All credits go to MicIRP & Xaudia.com.
We provide the ensembled logits of 3 CP-ResNet [2] models and 3 PaSST [1] transformer models trained on the TAU22 development set train split. The teacher models are trained using the cropped dataset technique introduced in the technical report. The logits are automatically downloaded when running the code and end up in the resources folder.
Based on a request, we also make the pre-trained teacher models available. In total 12 pre-trained models are published:
- 3 x PaSST trained with Freq-MixStyle and DIR:
passt_dirfms_1.pt
,passt_dirfms_2.pt
andpasst_dirfms_3.pt
- 3 x PaSST trained with Freq-MixStyle:
passt_fms_1.pt
,passt_fms_2.pt
andpasst_fms_3.pt
- 3 x CP-ResNet with 128K parameters trained with Freq-MixStyle and DIR:
cpr_128k_dirfms_1.pt
,cpr_128k_dirfms_2.pt
andcpr_128k_dirfms_3.pt
- 3 x CP-ResNet with 128K parameters trained with Freq-MixStyle:
cpr_128k_fms_1.pt
,cpr_128k_fms_2.pt
andcpr_128k_fms_3.pt
The file run_teacher_validation.py
is an example of how to use the teacher models for inference.
[1] Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer, “Efficient Training of Audio Transformers with Patchout,” in Interspeech, 2022.
[2] Khaled Koutini, H. Eghbal-zadeh, and G. Widmer, “Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks,” IEEE ACM Trans. Audio Speech Lang. Process., 2021.