/ssast_ciab

Primary LanguagePython

SSAST: Self-Supervised Audio Spectrogram Transformer

Introduction

Here the SSAST has been adapted for analysis in the CIAB project. Please find the main repo here

Illustration of AST.

The repository features an adapted copy of the Self-Supervised Audio Spectrogram Transformer (SSAST) proposed in the AAAI 2022 paper SSAST: Self-Supervised Audio Spectrogram Transformer (Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass; MIT CSAIL).

SSAST is the first patch-based joint discriminative and generative self-supervised learning framework, and also the first self-supervised learning framework for AST. SSAST significantly boosts AST performance on all downstream tasks we evaluated with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. SSAST can be used as a drop-in replacement of previous ImageNet (supervised) pretrained AST, and has the advantage of 1) no labeled data is used; 2) flexible patch size and shape, ImagenNet pretraining only supports square patches; and 3) better performance on many tasks, in particular speech tasks.

Citing

Please cite the paper if you find this repository useful.

@article{gong2021ssast,
  title={SSAST: Self-Supervised Audio Spectrogram Transformer},
  author={Gong, Yuan and Lai, Cheng-I Jeff and Chung, Yu-An and Glass, James},
  journal={arXiv preprint arXiv:2110.09784},
  year={2021}
}

Getting Started

Prepare the Environment

Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd ast/ 
python3 -m venv venvssast
source venvssast/bin/activate
pip install -r requirements.txt 

Replicate the CIAB analysis

First cd to /src/finetune/ciab/ then extract the CIAB features:

python prep_ciab.py

When this has run, to finetune a model on the task of COVID-19 detection from audio run:

sh run_ciab.sh

Pretrained-Models

We provide the following self-supervised pretrained models. All models are trained with full AudioSet + Librispeech. Click the model name to download. Tiny model should be able to pretrain and fine-tune on an 8GB GPU with a reasonable batch size.

For best performance, you should use either of the following models, patch-based AST is better for audio tasks, frame-based AST is better for speech tasks.

Model Name Data Pretrain fshape Pretrain tshape #Masked Patches Model Size Avg Audio Performance Avg Speech Performance
SSAST-Base-Patch-400 AudioSet + Librispeech 16 16 400 Base (89M) 59.9 79.5
SSAST-Base-Frame-400 AudioSet + Librispeech 128 2 400 Base (89M) 57.6 84.0

Following models does not have best performance, we release them for analysis purpose and low-resource devices.

Model Name Data Pretrain fshape Pretrain tshape #Masked Patches Model Size Avg Audio Performance Avg Speech Performance
SSAST-Base-Patch-250 AudioSet + Librispeech 16 16 250 Base (89M) 58.6 79.5
SSAST-Base-Frame-250 AudioSet + Librispeech 128 2 250 Base (89M) 55.6 81.6
SSAST-Small-Patch-400 AudioSet + Librispeech 16 16 400 Small (23M) 58.1 78.2
SSAST-Tiny-Patch-400 AudioSet + Librispeech 16 16 400 Tiny (6M) 53.3 75.7
SSAST-Tiny-Frame-400 AudioSet + Librispeech 128 2 400 Tiny (6M) 47.8 untested

Above links are dropbox direct download links (i.e., wget works). For those don't have access to Dropbox, use a VPN or use the OneDrive Links.

Contact

If you have a question, please bring up an issue (preferred) or send me an email harry.coppock@imperial.ac.uk