
Reading list for research topics in Sound AI

Reading List for topics in Sound Event Detection


Sound event detection aims at processing the continuous acoustic signal and converting it into symbolic descriptions of the corresponding sound events present at the auditory scene. Sound event detection can be utilized in a variety of applications, including context-based indexing and retrieval in multimedia databases, unobtrusive monitoring in health care, and surveillance. Since 2017, to utilise large multimedia data available, learning acoustic information from weak annotations was formulated. This reading list consists of papers for sound event detection and Sound AI.

Research papers

Survey papers

Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019

Sound Event Detection: A Tutorial, IEEE Signal Processing Magazine, Volume 38, Issue 5

Automated Audio Captioning: an Overview of Recent Progress and New Challenges, EURASIP Journal on Audio Speech and Music Processing 2022


Learning formulation

Weakly supervised scalable audio content analysis, ICME 2016

Audio Event Detection using Weakly Labeled Data, 24th ACM Multimedia Conference 2016

An approach for self-training audio event detectors using web data, 25th EUSIPCO 2017

A joint detection-classification model for audio tagging of weakly labelled data, ICASSP 2017

Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling, ICASSP 2019

Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2020

A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition, ICML 2020

Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection, ArXiv 2020

Duration robust weakly supervised sound event detection, ICASSP 2020

SeCoST:: Sequential Co-Supervision for Large Scale Weakly Labeled Audio Event Detection, ICASSP 2020

Guided Learning for Weakly-Labeled Semi-Supervised Sound Event Detection, ICASSP 2020

Unsupervised Contrastive Learning of Sound Event Representations, ICASSP 2021

Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events, ICASSP 2021

Comparison of Deep Co-Training and Mean-Teacher Approaches for Semi-Supervised Audio Tagging, ICASSP 2021

Enhancing Audio Augmentation Methods with Consistency Learning, ICASSP 2021

Network Architecture

Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks, ICASSP 2017

Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data, NIPS Workshop on Machine Learning for Audio 2017

Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network, ICASSP 2018

Orthogonality-Regularized Masked NMF for Learning on Weakly Labeled Audio Data, ICASSP 2018

Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019

Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes, ICASSP 2019

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization, TASLP 2020

DD-CNN: Depthwise Disout Convolutional Neural Network for Low-complexity Acoustic Scene Classification, ArXiv 2020

Effective Perturbation based Semi-Supervised Learning Method for Sound Event Detection, INTERSPEECH 2020

Weakly-Supervised Sound Event Detection with Self-Attention, ICASSP 2020

Improving Deep Learning Sound Events Classifiers using Gram Matrix Feature-wise Correlations, ICASSP 2021

An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection, ICASSP 2021

AST: Audio Spectrogram Transformer, INTERSPEECH 2021

Event Specific Attention for Polyphonic Sound Event Detection, INTERSPEECH 2021

Sound Event Detection with Adaptive Frequency Selection, WASPAA 2021

SSAST: Self-Supervised Audio Spectrogram Transformer, AAAI 2022

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection, ICASSP 2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer, INTERSPEECH 2022

Efficient Training of Audio Transformers with Patchout, INTERSPEECH 2022

BEATs: Audio Pre-Training with Acoustic Tokenizers, ArXiv 2022

Pooling functions

Adaptive Pooling Operators for Weakly Labeled Sound Event Detection, TASLP 2018

Comparing the Max and Noisy-Or Pooling Functions in Multiple Instance Learning for Weakly Supervised Sequence Learning Tasks, Interspeech 2018

A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling, ICASSP 2019

Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection, INTERSPEECH 2019

Weakly labelled audioset tagging with attention neural networks, TASLP 2019

Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019

Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019

A Global-Local Attention Framework for Weakly Labelled Audio Tagging, ICASSP 2021

Missing or noisy audio:

Sound event detection and time–frequency segmentation from weakly labelled data, TASLP 2019

Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019

Improving weakly supervised sound event detection with self-supervised auxiliary tasks, INTERSPEECH 2021

Data Augmentation:

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification, INTERSPEECH 2021

Representation Learning

Contrastive Predictive Coding of Audio with an Adversary, INTERSPEECH 2020

Towards Learning a Universal Non-Semantic Representation of Speech, INTERSPEECH 2021

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection, ICASSP 2021

FRILL: A Non-Semantic Speech Embedding for Mobile Devices, INTERSPEECH 2021

HEAR 2021: Holistic Evaluation of Audio Representations, PMLR: NeurIPS 2021 Competition Track

Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks, ICASSP 2022

Towards Learning Universal Audio Representations, ICASSP 2022

SSAST: Self-Supervised Audio Spectrogram Transformer, AAAI 2022

Multi-Task Learning

A Joint Separation-Classification Model for Sound Event Detection of Weakly Labelled Data, ICASSP 2018

Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection, ArXiv 2019

Multi-Task Learning and post processing optimisation for sound event detection, DCASE 2019

Label-efficient audio classification through multitask learning and self-supervision, ICLR 2019

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling, INTERSPEECH 2020

Improving weakly supervised sound event detection with self-supervised auxiliary tasks, INTERSPEECH 2021

Identifying Actions for Sound Event Classification, WASPAA 2021

Impact of Acoustic Event Tagging on Scene Classification in a Multi-Task Learning Framework, INTERSPEECH 2022


Few-Shot Audio Classification with Attentional Graph Neural Networks, INTERSPEECH 2019

Continual Learning of New Sound Classes Using Generative Replay, WASSPA 2019

Few-Shot Sound Event Detection, ICASSP 2020

Few-Shot Continual Learning for Audio Classification, ICASSP 2021

Unsupervised and Semi-Supervised Few-Shot Acoustic Event Classification, ICASSP 2021

Who Calls the Shots? Rethinking Few-Shot Learning for Audio, WASPAA 2021

A Mutual Learning Framework For Few-Shot Sound Event Detection, ICASSP 2022

Active Few-Shot Learning for Sound Event Detection, INTERSPEECH 2022

Adapting Language-Audio Models as Few-Shot Audio Learners, INTERSPEECH 2023


AudioCLIP: Extending CLIP to Image, Text and Audio, ICASSP 2022

Wav2CLIP: Learning Robust Audio Representations From CLIP, ICASSP 2022

CLAP 👏: Learning Audio Concepts From Natural Language Supervision, ICASSP 2023

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation, ICASSP 2023

Listen, Think, and Understand, ArXiv 2023

Pengi 🐧: An Audio Language Model for Audio Tasks, ArXiv 2023

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action, ArXiv 2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities, ArXiv 2023

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models, ArXiv 2023

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities, ArXiv 2024

Knowledge Transfer

Transfer learning of weakly labelled audio, WASPAA 2017

Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes, ICASSP 2018

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition, TASLP 2020

Do sound event representations generalize to other audio tasks? A case study in audio transfer learning, INTERSPEECH 2021

Polyphonic SED

A first attempt at polyphonic sound event detection using connectionist temporal classification, ICASSP 2017

Polyphonic Sound Event Detection with Weak Labeling, Thesis 2018

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy, DCASE 2019

Evaluation of Post-Processing Algorithms for Polyphonic Sound Event Detection, WASPAA 2019

Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection, TASLP 2020

Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection, ICASSP 2022

Loss function

Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance, ICASSP 2021

Audio and Visual

A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging, ICASSP 2018

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data, IJCAI 2020

Labelling unlabelled videos from scratch with multi-modal self-supervision, NeurIPS 2020

Audio-Visual Event Recognition Through the Lens of Adversary, ICASSP 2021

Taming Visually Guided Sound Generation, BMVC 2021

Learning Audio-Video Modalities from Image Captions, ECCV 2022

UAVM: Towards Unifying Audio and Visual Models, IEEE Signal Processing letters

Contrastive Audio-Visual Masked Autoencoder, ICLR 2023

Audio Captioning

Automated audio captioning with recurrent neural networks, WASPAA 2017

Audio caption: Listen and tell, ICASSP 2018

AudioCaps: Generating captions for audios in the wild, NAACL 2019

Audio Captioning Based on Combined Audio and Semantic Embeddings, ISM 2020

Clotho: An Audio Captioning Dataset, ICASSP 2020

A Transformer-based Audio Captioning Model with Keyword Estimation, INTERSPEECH 2020

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events, ICASSP 2021

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags, ICASSP 2021

Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization, ICASSP 2022

Sound Event Detection Guided by Semantic Contexts of Scenes, ICASSP 2022

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning, INTERSPEECH 2022

Audio Retrieval

Audio Retrieval with Natural Language Queries: A Benchmark Study, IEEE Transactions on Multimedia 2022

On Metric Learning for Audio-Text Cross-Modal Retrieval, INTERSPEECH 2022

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval, INTERSPEECH 2022

Audio Retrieval with WavText5K and CLAP Training, ArXiv 2022

Audio Generation

Acoustic Scene Generation with Conditional Samplernn, ICASSP 2019

Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning, MLSP 2021

Taming Visually Guided Sound Generation, BMVC 2021

Diffsound: Discrete Diffusion Model for Text-to-sound Generation, ArXiv 2022

AudioGen: Textually Guided Audio Generation, ICML 2023

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models, ArXiv 2023

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models, ICML 2023

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models, ArXiv 2023

Diverse and Vivid Sound Generation from Text Descriptions, ICASSP 2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation, ArXiv 2023

Simple and Controllable Music Generation, ArXiv 2023

Audiobox: Unified Audio Generation with Natural Language Prompts, ArXiv 2023

Masked Audio Generation using a Single Non-Autoregressive Transformer, ArXiv 2024


Audio event and scene recognition: A unified approach using strongly and weakly labeled data, IJCNN 2017

Sound Event Detection Using Point-Labeled Data, WASPAA 2019

An Investigation of the Effectiveness of Phase for Audio Classification, ICASSP 2022


Task Dataset Source Num. Files
Sound Event Classification ESC-50 freesound.org 2k files
Sound Event Classification DCASE17 Task 4 YT videos 2k files
Sound Event Classification US8K freesound.org 8k files
Sound Event Classification FSD50K freesound.org 50k files
Sound Event Classification AudioSet YT videos 2M files
COVID-19 Detection using Coughs DiCOVA Volunteers recording audio via a website 1k files
Few-shot Bioacoustic Event Detection DCASE21 Task 5 audio 4k+ files
Acoustic Scene Classification DCASE18 Task 1 Recorded by TUT 1.5k
Various VGG-Sound Web videos 200k files
Audio Captioning Clotho freesound.org 5k files
Audio Captioning AudioCaps YT videos 51k files
Audio-text SoundDescs BBC Sound Effects 32k files
Audio-text WavText5K Varied 5k files
Audio-text LAION 630k Varied 630k files
Audio-text WavCaps Varied 400k files
Action Recognition UCF101 Web videos 13k files
Unlabeled YFCC100M Yahoo videos 1M files

Other audio-based datasets to consider
DCASE dataset list


List of old workshops (archived) and on-going workshops/conferences/journals:

Venues link
Machine Learning for Audio Signal Processing, NIPS 2017 workshop https://nips.cc/Conferences/2017/Schedule?showEvent=8790
MLSP: Machine Learning for Signal Processing https://ieeemlsp.cc/
WASPAA: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics https://www.waspaa.com
ICASSP: IEEE International Conference on Acoustics Speech and Signal Processing https://2021.ieeeicassp.org/
INTERSPEECH https://www.interspeech2021.org/
IEEE/ACM Transactions on Audio, Speech and Language Processing https://dl.acm.org/journal/taslp
DCASE http://dcase.community/


Computational Analysis of Sound Scenes and Events


