SSAST: Self-Supervised Audio Spectrogram Transformer

Introduction
Citing
Getting Started
SSAST Model
Data Preparation
Self-Supervised Pretraining
Fine-tuning On Downstream Tasks
Pretrained Models
Contact

Introduction

Here the SSAST has been adapted for analysis in the CIAB project. Please find the main repo here

The repository features an adapted copy of the Self-Supervised Audio Spectrogram Transformer (SSAST) proposed in the AAAI 2022 paper SSAST: Self-Supervised Audio Spectrogram Transformer (Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass; MIT CSAIL).

SSAST is the first patch-based joint discriminative and generative self-supervised learning framework, and also the first self-supervised learning framework for AST. SSAST significantly boosts AST performance on all downstream tasks we evaluated with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. SSAST can be used as a drop-in replacement of previous ImageNet (supervised) pretrained AST, and has the advantage of 1) no labeled data is used; 2) flexible patch size and shape, ImagenNet pretraining only supports square patches; and 3) better performance on many tasks, in particular speech tasks.

Citing

Please cite the paper if you find this repository useful.

@article{gong2021ssast,
  title={SSAST: Self-Supervised Audio Spectrogram Transformer},
  author={Gong, Yuan and Lai, Cheng-I Jeff and Chung, Yu-An and Glass, James},
  journal={arXiv preprint arXiv:2110.09784},
  year={2021}
}

Getting Started

Prepare the Environment

Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd ast/ 
python3 -m venv venvssast
source venvssast/bin/activate
pip install -r requirements.txt

Replicate the CIAB analysis

First cd to /src/finetune/ciab/ then extract the CIAB features:

python prep_ciab.py

When this has run, to finetune a model on the task of COVID-19 detection from audio run:

sh run_ciab.sh

Pretrained-Models

We provide the following self-supervised pretrained models. All models are trained with full AudioSet + Librispeech. Click the model name to download. Tiny model should be able to pretrain and fine-tune on an 8GB GPU with a reasonable batch size.

For best performance, you should use either of the following models, patch-based AST is better for audio tasks, frame-based AST is better for speech tasks.

Model Name	Data	Pretrain fshape	Pretrain tshape	#Masked Patches	Model Size	Avg Audio Performance	Avg Speech Performance
SSAST-Base-Patch-400	AudioSet + Librispeech	16	16	400	Base (89M)	59.9	79.5
SSAST-Base-Frame-400	AudioSet + Librispeech	128	2	400	Base (89M)	57.6	84.0

Following models does not have best performance, we release them for analysis purpose and low-resource devices.

Model Name	Data	Pretrain fshape	Pretrain tshape	#Masked Patches	Model Size	Avg Audio Performance	Avg Speech Performance
SSAST-Base-Patch-250	AudioSet + Librispeech	16	16	250	Base (89M)	58.6	79.5
SSAST-Base-Frame-250	AudioSet + Librispeech	128	2	250	Base (89M)	55.6	81.6
SSAST-Small-Patch-400	AudioSet + Librispeech	16	16	400	Small (23M)	58.1	78.2
SSAST-Tiny-Patch-400	AudioSet + Librispeech	16	16	400	Tiny (6M)	53.3	75.7
SSAST-Tiny-Frame-400	AudioSet + Librispeech	128	2	400	Tiny (6M)	47.8	untested

Above links are dropbox direct download links (i.e., wget works). For those don't have access to Dropbox, use a VPN or use the OneDrive Links.

Contact

If you have a question, please bring up an issue (preferred) or send me an email harry.coppock@imperial.ac.uk

hslau44/ssast_ciab