Whisper_audio_classification

This is the official implementation of our paper "Investigating the Emergent Audio Classification Ability of ASR Foundation Models".

Authors: Rao Ma†, Adian Liusie†, Mark Gales, Kate Knill

Keypoints:

In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification.
With simple template-based text prompts, we demonstrate that Whisper shows promising zero-shot classification performance on 8 audio-classification datasets, outperforming existing state-of-the-art zero-shot baseline's accuracy by an average of 9%.
To unlock the emergent ability, we introduce debiasing approaches. A simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains.
We also show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

[slides]

Overview

data: The processed datafiles where each line contains audio_id, audio_path, and the filled prompt text associated with a class label. Model performance of using different prompt templates are compared in the original paper, as listed in Appendix D.
src/whisper_audio_classification.py: Compute the zero-shot ASR classification probability $P_\theta(t(w_k)|s)$ for each class $w_k$ given the audio $s$, as specified in Section 3.1.
src/scoring.py: Calculate the classification accuracy with different types of debiasing methods introduced in Section 3.2.

Requirements

python 3.10
torch 1.12.1
pytorch_lightning 2.0.2
openai-whisper

Data Preparation

In our paper we experimented with 8 audio classification datasets and 1 audio question answering dataset.

Please download the original wav data from the test split using the following links and modify the audio paths for files under the data/ directory accordingly.

Sound Event Classification:
- ESC-50: https://github.com/karolpiczak/ESC-50
- UrbanSound8K: https://zenodo.org/records/1203745
Acoustic Scene Classification:
- TUT2017: https://zenodo.org/records/1040168
Vocal Sound Classification:
- Vocal sound: https://github.com/YuanGongND/vocalsound
Emotion Recognition:
- RAVDESS: https://zenodo.org/records/1188976
- CREMA-D: https://github.com/CheyneyComputerScience/CREMA-D
Music Genre Classification:
- GTZAN: https://github.com/chittalpatel/Music-Genre-Classification-GTZAN
Speaker Counting:
- LibriCount: https://zenodo.org/records/1216072
Audio Question Answering:
- Clotho-AQA: https://zenodo.org/records/6473207

Instruction

Here we use the ESC-50 dataset as an example to illustrate how to prompt the emergent audio classification ability of Whisper:

Run the zero-shot audio classification:

python3 src/whisper_audio_classification.py --eval_list_file data/ESC50/prompt0

Run the zero-shot audio classification with zero input:

python3 src/whisper_audio_classification.py \
    --eval_list_file data/ESC50/prompt0 \
    --ilm zero --avg_len 5.0

Run the zero-shot audio classification with Gaussian noise:

python3 src/whisper_audio_classification.py \
    --eval_list_file data/ESC50/prompt0 \
    --ilm gaussian --avg_len 5.0

Calculate the classification accuracy with different calibration methods:

python3 src/scoring.py exp/large-v2/ESC50/prompt0

Results using this repository

We list the classification accuracy using Whisper large-v2 obtained with prompt0, please refer to our paper for more detailed results.

Model	ESC50	US8K	TUT	Vocal	RAVDESS	CREMAD	GTZAN	LibriCnt	Avg.
Random	2.0	10.0	6.7	16.7	12.5	16.7	10.0	9.1	10.4
AudioCLIP	69.4	65.3	-	-	-	-	-	-	-
CLAP	82.6	73.2	29.6	49.4	16.0	17.8	25.2	17.9	39.0

Ours, Uncalibrated	38.9	50.5	7.7	60.1	15.1	20.2	38.2	9.2	30.0
Ours, Zero-input	35.9	52.1	18.0	57.5	29.4	26.5	45.8	13.6	34.9
Ours, Prior-matched	65.4	60.4	26.0	84.9	41.7	28.8	60.9	17.3	48.2

Citation

@article{ma2023investigating,
    title={Investigating the Emergent Audio Classification Ability of ASR Foundation Models},
    author={Ma, Rao and Liusie, Adian and Gales, Mark JF and Knill, Kate M},
    journal={arXiv preprint arXiv:2311.09363},
    year={2023}
}

About Me

https://julirao.github.io

JuliRao/Whisper_audio_classification