In this repository, we publish pre-trained models and code for the ICASSP'25 paper: Effective Pre-Training of Audio Transformers for Sound Event Detection.
In this paper, we propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. For five transformers, we show that this additional pre-training step leads to substantial performance improvements on frame-level downstream tasks. We release all model checkpoints and hope that they will help researchers improve tasks that require high-quality frame-level representations.
The codebase is under construction; the next steps involve:
- Upload all pre-trained checkpoints and model files [DONE]
- Create a script that demonstrates how the pre-trained checkpoints can be loaded and used for inference [DONE]
- Upload an arxiv version of the submitted paper [DONE]
- Add a table outlining the external checkpoints used in this work [DONE]
- Evaluation routine on the AudioSet frame-level annotations [DONE]
- Include a cleaned version of the AudioSet Strong training routine [DONE]
- Upload the ensemble logits for the AudioSet Strong dataset
- Demonstrate how the pre-trained transformer can be used for downstream tasks
- Wrap this repository in an installable python package for easy use
- If needed, create a new environment with python 3.9 and activate it:
conda create -n ptsed python=3.9
conda activate ptsed
- Install pytorch build that suits your system. For example:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# or for cuda >= 12.1
pip3 install torch torchvision torchaudio
- Install the requirements:
pip3 install -r requirements.txt
The script inference.py demonstrates how to load a pre-trained model and run sound event detection on an audio file of arbitrary length.
python inference.py --cuda --model_name="BEATs" --audio_file="test_files/752547__iscence__milan_metro_coming_in_station.wav"
The argument model_name
specifies the transformer used for inference, and the corresponding pre-trained model checkpoint
is automatically downloaded and placed in the folder resources.
The argument audio_file
specifies the path to a single audio file. There is one example file included.
More example files can be downloaded from the GitHub release.
The following is a list of checkpoints that we have created and worked with in our paper. For external checkpoints, we provide the download link. "Checkpoint Name" refers to the respective names in our GitHub release.
Model | Pre-Training | Checkpoint Name | Download Link | Reference |
---|---|---|---|---|
BEATs | SSL | BEATs_ssl.pt | here | [1] |
BEATs | Weak | BEATs_weak.pt | here | [1] |
BEATs | Strong | BEATs_strong_1.pt | ours | [1] |
ATST-Frame | SSL | ATST-F_ssl.pt | here | [2] |
ATST-Frame | Weak | ATST-F_weak.pt | here | [2] |
ATST-Frame | Strong | ATST-F_strong_1.pt | ours | [2] |
fPaSST | SSL | fpasst_im.pt | here | [3], [4] |
fPaSST | Weak | fpasst_weak.pt | ours | [3], [4] |
fPaSST | Strong | fpasst_strong_1.pt | ours | [3], [4] |
ASiT | SSL | ASIT_ssl.pt | here | [5] |
ASiT | Weak | ASIT_weak.pt | ours | [5] |
ASiT | Strong | ASIT_strong_1.pt | ours | [5] |
M2D | SSL | M2D_ssl.pt | here | [6] |
M2D | Weak | M2D_weak.pt | here | [6] |
M2D | Strong | M2D_strong_1.pt | ours | [6] |
Activate conda environment:
conda activate ptsed
Install additional requirements for training:
CFLAGS='-O3 -march=native' pip install https://github.com/f0k/minimp3py/archive/master.zip
pip install -r train_requirements.txt
- Follow the steps described here to obtain AudioSet, encoded as mp3 files and packed into HDF5 format.
You will end up with a directory containing three HDF5 files:
- balanced_train_segments_mp3.hdf
- unbalanced_train_segments_mp3.hdf
- eval_segments_mp3.hdf
- We use the Huggingface datasets API for fast and memory-efficient loading of the dataset. The hf_dataset_gen/audioset_strong.py file takes the dataset from Step 1 and converts it into a Huggingface dataset.
Adapt the paths in hf_dataset_gen/audioset_strong.py marked as TODOs (2x: hdf5 path and target path for HF dataset).
- Create the Hunggingface dataset:
cd hf_dataset_gen
python audioset_strong.py
- The path to the dataset is specified via an environment variable. When you access the dataset for training or evaluation, set the environment variable. For example, in our case, the Huggingface dataset path is set to:
/share/hel/datasets/HF_datasets/local/audioset_strong_official
And therefore we set the following environment variable:
export HF_DATASETS_CACHE=/share/hel/datasets/HF_datasets/cache/
The size of a pseudo label file is around 50 GB, resulting from a pseudo label matrix (250 timesteps x 447 class predictions) stored for each file in the dataset (~100k recordings). We are currently figuring out how to best share these. Stay tuned!
Example: Train ATST-F, pretrained on AudioSet weak, with an RNN on top, use the balanced sampler and set wavmix augmentation to probability of 1.0.
python ex_audioset_strong.py --model_name=ATST-F --seq_model_type=rnn --use_balanced_sampler --pretrained=weak --wavmix_p=1.0
Check out the results: https://api.wandb.ai/links/cp_tobi/tphswm5k
Evaluate the AudioSet Strong pre-trained checkpoint of ATST-F:
python ex_audioset_strong.py --model_name=ATST-F --pretrained=strong
If everything is set up correctly, this should give a val/psds1_macro_averaged
of around 46.1.
This section presents the main results reported in the paper, along with additional ablation studies, including teacher model performances, comparisons of different sequence models, and evaluations using the DESED baseline system setup. The additional ablation studies have been requested by ICASSP`25 reviewers.
- All results represent averages over three independent runs.
- For AudioSet Strong, we employ the threshold-independent PSDS1 [7] metric to ensure fine-grained temporal evaluation.
- For the Li et al. [2] row, we reproduced their AudioSet Strong training pipeline.
- Alongside the Proposed Pipeline, we include ablation studies for three settings: no KD, no RNN in teacher models, and no pre-training on AudioSet Weak (no Step 2).
ATST-F | BEATs | fPaSST | M2D | ASiT | |
---|---|---|---|---|---|
Li et al. [2] | 40.9 | 36.5 | 38.7 | 36.9 | 37.0 |
Proposed Pipeline | 45.8 | 46.5 | 45.4 | 46.3 | 46.2 |
-- without KD | 41.8 | 44.1 | 40.7 | 41.1 | 40.9 |
-- without RNN | 45.7 | 45.8 | 45.3 | 46.0 | 46.1 |
-- without Step 2 | 45.7 | 46.3 | 45.2 | 44.9 | 46.2 |
Conclusions:
- The significant performance gap to [2] stems mainly from our three design choices (KD, RNNs, Step 2), but also improvements in training on AudioSet Strong, including balanced sampling and aggressive data augmentation.
- Knowledge Distillation (KD) has the most substantial impact, underlining the effectiveness of the ensemble-KD approach.
- RNNs in teacher models and pre-training on AudioSet Weak offer modest improvements but are justified due to their low additional cost. Notably, they do not increase student model complexity, and AudioSet Weak checkpoints are publicly available for most transformers.
- The table below shows teacher model results for each transformer.
- Column Avg. Ind. represents the average performance across all single models in the row.
- Column Ensemble represents the performance of the ensemble consisting of all models in the respective row.
ATST-F | BEATs | fPaSST | M2D | ASiT | Avg. Ind. | Ensemble | |
---|---|---|---|---|---|---|---|
Proposed Teacher Pipeline | 43.3 | 45.8 | 43.3 | 44.1 | 43.3 | 44.9 | 47.1 |
-- without RNN | 41.8 | 44.1 | 40.7 | 41.1 | 40.9 | 41.7 | 46.2 |
-- without Step 2 | 43.5 | 34.4 | 40.9 | 43.8 | 43.2 | 41.2 | 46.5 |
Conclusions:
- Ensemble Performance: The Ensemble column reflects the teacher ensemble performances utilized for Knowledge Distillation (KD) in table above.
- Impact of RNNs and Step 2: Incorporating RNNs and Step 2 (AudioSet Weak pre-training) notably enhances single-model teacher performance, with the exception of ATST-F without Step 2.
- Benefits of Ensembling: While individual model performances show considerable variability (Avg. Ind.), ensembling stabilizes and elevates overall performance, as evidenced by the smaller differences in the Ensemble column.
- BEATs-Specific Insights: BEATs excels in the Proposed Teacher Pipeline and without RNN settings but underperforms in the without Step 2 configuration. This discrepancy may be attributed to its unique SSL pre-training routine and longer sequence length (resulting from more tokens being extracted from the input).
- The use of an additional sequence model on top of the AudioSet Weak pre-trained transformers stems from our hypothesis that adding capacity specifically for temporally-strong predictions can enhance performance.
- The table below shows teacher model performances for various sequence models added on top of the transformers before training on AudioSet Strong. The paper uses BiGRUs (RNN) as they deliver the best performance.
- We investigated 4 different sequence models:
- We varied the inner dimension (dim) and the number of layers (<Model Type>:<#layers>; e.g., TF:2 means two Transformer layers were added on top of the pre-trained transformer).
- Setup Notes:
- Ablations were performed using ATST-F due to its computational efficiency.
- Performance without a sequence model was 41.8 PSDS1.
- Removing the top Transformer layers, which may overfit to AudioSet Weak labels, decreased performance.
- For MAMBA, only a single layer was feasible due to memory constraints.
PSDS1 | RNN:1 | RNN:2 | RNN:3 | TF:1 | TF:2 | TF:3 | ATT:1 | ATT:2 | ATT:3 | MAMBA:1 |
---|---|---|---|---|---|---|---|---|---|---|
dim=256 | 8.72 | 3.76 | 3.10 | 34.25 | 34.62 | 34.05 | 40.08 | 39.70 | 39.55 | 40.27 |
dim=512 | 40.62 | 7.26 | 0.12 | 40.41 | 41.11 | 40.30 | 41.78 | 41.91 | 41.95 | 41.25 |
dim=1024 | 42.74 | 42.75 | 43.00 | 42.69 | 42.22 | 42.20 | 42.44 | 42.45 | 42.08 | 41.97 |
dim=2048 | 43.41 | 43.43 | 42.66 | 42.90 | 38.94 | 42.90 | 41.58 | 41.59 | 41.42 | 41.72 |
Conclusions:
- Best model type: The highest performance was achieved with 2 BiGRU layers, followed by Transformer, Self-Attention, and MAMBA. All sequence models improved performance compared to using no additional sequence model, though MAMBA's gains were marginal.
- Inner Dimension: Larger inner dimensions consistently led to better performance across all sequence models. Significant improvements required dimensions ≥1024, while smaller dimensions (e.g., 256) often degraded performance, with severe failures for BiGRU. We believe that large inner dimensions are essential due to the high number of classes (447) in AudioSet Strong.
- Number of layers: Performance was relatively insensitive to the number of layers for most sequence models, with optimal results often achieved with just 1–2 layers.
- Three frame-level downstream tasks:
- DCASE 2023 Task 4: Domestic Environment Sound Event Detection (DESED), metric: PSDS 1
- DCASE 2016 Task 2 (DC16-T2), metric: onset F-measure
- MAESTRO 5hr (MAESTRO), metric: onset F-measure
- For DESED, we followed a simplified setup in line with [2], excluding unsupervised data (no mean teacher approach) and an additional CRNN component from the DCASE 2023 Task 4 baseline system. While state-of-the-art approaches such as [4] and [8] leverage advanced techniques (e.g., multi-stage/multi-iteration training, sophisticated data augmentation, and interpolation consistency training), we deliberately avoided these complexities, as the focus is on a precise evaluation of pre-training quality.
Conclusions:
- In-Domain Tasks: The pipeline demonstrates strong, consistent improvements for all transformers on DESED and DC16-T2, showcasing its effectiveness for in-domain tasks.
- Out-of-Domain Task: Results on MAESTRO (piano pitch prediction) are inconclusive. This limitation suggests that the proposed pre-training strategy yields substantial gains only when audio and labels are similar to the AudioSet ontology.
- Simplified DESED Setup: Despite the simplified setup (no CRNN, no unsupervised data), performance remains comparable to the DCASE 2023 Task 4 baseline system.
To complement the simplified DESED setup presented earlier, we provide results for the DCASE 2023 Task 4 baseline system setup for ATST-F and BEATs in the table below. Note that hyperparameters were not extensively tuned, and the data setup may differ slightly from the original baseline.
Model | Checkpoint | Notes | Performance |
---|---|---|---|
ATST-F | Step 1 (SSL) | 42.7 | |
ATST-F | Step 2 (AS weak) | 47.1 | |
ATST-F | Full Pipeline | 50.4 | |
ATST-F | Full Pipeline | dropped 2 TF layers | 51.1 |
BEATs | Step 1 (SSL) | 39.7 | |
BEATs | Step 2 (AS weak) | 48.1 | |
BEATs | Full Pipeline | 48.6 | |
BEATs | Full Pipeline | dropped 2 TF layers | 51.1 |
Conclusions:
- The Full Pipeline substantially improves performance over Step 1 (SSL) and Step 2 (AS Weak) for both ATST-F and BEATs.
- Dropping the last two Transformer layers notably enhances results, suggesting that the final layers may focus on AudioSet Strong label-specific features, while earlier layers provide more general, transferable embeddings that benefit the DESED task. We will conduct further experiments to find out whether dropping Transformer layers is generalizable to other tasks or specific to the DESED task.
[1] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: Audio pre-training with acoustic tokenizers,” in Proceedings of the International Conference on Machine Learning (ICML), 2023.
[2] X. Li, N. Shao, and X. Li, “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,” Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1336–1351, 2024.
[3] K. Koutini, J. Schl¨uter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Proceedings of the Interspeech Conference, 2022.
[4] F. Schmid, P. Primus, T. Morocutti, J. Greif, and G. Widmer, “Multi-iteration multi-stage fine-tuning of transformers for sound event detection with heterogeneous datasets,” in Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2024.
[5] S. Atito, M. Awais, W. Wang, M. D. Plumbley, and J. Kittler, “ASiT: Local-global audio spectrogram vision transformer for event classification,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 3684–3693, 2024.
[6] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, M. Yasuda, S. Tsubaki, and K. Imoto, “M2D-CLAP: masked modeling duo meets CLAP for learning general-purpose audio-language representation,” in Proceedings of the Interspeech Conference, 2024.
[7] J. Ebbers, R. Haeb-Umbach, and R. Serizel, “Threshold independent evaluation of sound event detection scores,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
[8] N. Shao, X. Li, and X. Li, “Fine-tune the pretrained ATST model for sound event detection,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024