- Introduction
- Citing
- Getting Started
- SSAST Model
- Data Preparation
- Self-Supervised Pretraining
- Fine-tuning On Downstream Tasks
- Pretrained Models
- Contact
Here the SSAST has been adapted for analysis in the CIAB project. Please find the main repo here
The repository features an adapted copy of the Self-Supervised Audio Spectrogram Transformer (SSAST) proposed in the AAAI 2022 paper SSAST: Self-Supervised Audio Spectrogram Transformer (Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass; MIT CSAIL).
SSAST is the first patch-based joint discriminative and generative self-supervised learning framework, and also the first self-supervised learning framework for AST. SSAST significantly boosts AST performance on all downstream tasks we evaluated with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. SSAST can be used as a drop-in replacement of previous ImageNet (supervised) pretrained AST, and has the advantage of 1) no labeled data is used; 2) flexible patch size and shape, ImagenNet pretraining only supports square patches; and 3) better performance on many tasks, in particular speech tasks.
Please cite the paper if you find this repository useful.
@article{gong2021ssast,
title={SSAST: Self-Supervised Audio Spectrogram Transformer},
author={Gong, Yuan and Lai, Cheng-I Jeff and Chung, Yu-An and Glass, James},
journal={arXiv preprint arXiv:2110.09784},
year={2021}
}
Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.
cd ast/
python3 -m venv venvssast
source venvssast/bin/activate
pip install -r requirements.txt
First cd to /src/finetune/ciab/ then extract the CIAB features:
python prep_ciab.py
When this has run, to finetune a model on the task of COVID-19 detection from audio run:
sh run_ciab.sh
We provide the following self-supervised pretrained models. All models are trained with full AudioSet + Librispeech. Click the model name to download. Tiny model should be able to pretrain and fine-tune on an 8GB GPU with a reasonable batch size.
For best performance, you should use either of the following models, patch-based AST is better for audio tasks, frame-based AST is better for speech tasks.
Model Name | Data | Pretrain fshape | Pretrain tshape | #Masked Patches | Model Size | Avg Audio Performance | Avg Speech Performance |
---|---|---|---|---|---|---|---|
SSAST-Base-Patch-400 | AudioSet + Librispeech | 16 | 16 | 400 | Base (89M) | 59.9 | 79.5 |
SSAST-Base-Frame-400 | AudioSet + Librispeech | 128 | 2 | 400 | Base (89M) | 57.6 | 84.0 |
Following models does not have best performance, we release them for analysis purpose and low-resource devices.
Model Name | Data | Pretrain fshape | Pretrain tshape | #Masked Patches | Model Size | Avg Audio Performance | Avg Speech Performance |
---|---|---|---|---|---|---|---|
SSAST-Base-Patch-250 | AudioSet + Librispeech | 16 | 16 | 250 | Base (89M) | 58.6 | 79.5 |
SSAST-Base-Frame-250 | AudioSet + Librispeech | 128 | 2 | 250 | Base (89M) | 55.6 | 81.6 |
SSAST-Small-Patch-400 | AudioSet + Librispeech | 16 | 16 | 400 | Small (23M) | 58.1 | 78.2 |
SSAST-Tiny-Patch-400 | AudioSet + Librispeech | 16 | 16 | 400 | Tiny (6M) | 53.3 | 75.7 |
SSAST-Tiny-Frame-400 | AudioSet + Librispeech | 128 | 2 | 400 | Tiny (6M) | 47.8 | untested |
Above links are dropbox direct download links (i.e., wget works). For those don't have access to Dropbox, use a VPN or use the OneDrive Links.
If you have a question, please bring up an issue (preferred) or send me an email harry.coppock@imperial.ac.uk