ddlBoJack/Awesome-Speech-Pretraining

Paper, Code and Statistics for Self-Supervised Learning and Pre-Training on Speech.

Table of Contents generated with DocToc

Awesome-Speech-Pretraining
- Papers
- Resources
- Statistics

Awesome-Speech-Pretraining

Papers, Resources, and Statistics for Self-Supervised Learning and Pre-Training on Speech.

🌟 represents important papers.

Papers

2018

🌟 CPC: Representation Learning with Contrastive Predictive Coding - A Oord et al, arXiv 2018

2019

APC: An Unsupervised Autoregressive Model for Speech Representation Learning - YA Chung et al, INTERSPEECH 2019
🌟 wav2vec: Unsupervised Pre-training for Speech Recognition - S Schneider et al, INTERSPEECH 2019
🌟 vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations - A Baevski et al, arXiv 2019, ICLR 2020
MPC: Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - D Jiang et al, arXiv 2019
PASE: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - S Pascual et al, INTERSPEECH 2019

2020

Bidir CPC: Learning robust and multilingual speech representations - K Kawakami et al, EMNLP 2020
Multi-target APC: Improved speech representations with multi-target autoregressive predictive coding - YA Chung et al, ACL 2020
Modified CPC: Unsupervised pretraining transfers well across languages - M Riviere et al, ICASSP 2020
Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - AT Liu et al, ICASSP 2020
vq-wav2vec-FT: Effectiveness of self-supervised pre-training for asr - A Baevski et al, ICASSP 2020
DeCoAR: Deep contextualized acoustic representations for semi-supervised speech recognition - S Ling et al, ICASSP 2020
Improved noisy student training for automatic speech recognition - DS Park et al, INTERSPEECH 2020
🌟 wav2vec 2.0: A framework for self-supervised learning of speech representations - A Baevski et al, NeurIPS 2020
Multi-lingual wav2vec 2.0: Unsupervised cross-lingual representation learning for speech recognition - A Conneau et al, arXiv 2020
Self-Training wav2vec 2.0: Self-training and Pre-training are Complementary for Speech Recognition - Q Xu et al, arXiv 2020, ICASSP 2021
Decoar 2.0: Deep contextualized acoustic representations with vector quantizationarXiv 2020, ICASSP 2021
Pushing the limits of semi-supervised learning for automatic speech recognition - Y Zhang et al, arXiv 2020, NeurIPS Workshop 2020

2021

Unispeech: Unified speech representation learning with labeled and unlabeled data- C Wang et al, ACL 2021
Tera: Self-supervised learning of transformer encoder representation for speech - AT Liu et al, TASLP 2021
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - WN Hsu et al, INTERSPEECH 2021
Zero-shot wav2vec 2.0: Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - Q Xu et al, arXiv 2021
🌟 wav2vec-U: Unsupervised Speech Recognition - A Baevski et al, NeurIPS 2021
🌟 HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - WN Hsu et al, TASLP 2021
🌟 SUPERB: Speech processing Universal PERformance Benchmark - S Yang et al, INTERSPEECH 2021
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - G Zheng et al, EMNLP 2021
ILS-SSL: Self-Supervised Learning for speech recognition with Intermediate layer supervision - C Wang et al, ICASSP 2021
Wavlm: Large-scale self-supervised pre-training for full stack speech processing - S Chen et al, arXiv 2021, JSTSP 2022
Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - Y Zhang et al, arXiv 2021, JSTSP 2022
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - J Ao et al, arXiv 2021, ACL 2022

2022

🌟 Data2vec: A general framework for self-supervised learning in speech, vision and language - A Baevski et al, ICML 2022
BEST-RQ: Self-supervised Learning with Random-projection Quantizer for Speech Recognition - CC Chiu et al, ICML 2022
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - HS Tsai et al, ACL 2022
🌟 wav2vec-U 2.0: Towards End-to-end Unsupervised Speech Recognition - AH Liu et al, SLT 2022
c-siam: Contrastive Siamese Network for Semi-Supervised Speech Recognition - S Khorram et al, ICASSP 2022
Speech2C: Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - J Ao et al, INTERSPEECH 2022
SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - W Huang et al, ICLR 2022
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - F Wu et al, arXiv 2022, ICASSP 2023
HuBERT-AP: Speech Pre-training with Acoustic Piece - S Ren et al, INTERSPEECH 2022
PBERT: Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - C Wang et al, INTERSPEECH 2022
data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - A Baevski et al, arXiv 2022
CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - C Meng et al, arXiv 2022, INTERSPEECH 2023
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - Z Ma et al, arXiv 2022, INTERSPEECH 2023

2023

CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - R Fan et al, ICASSP 2023
data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - VS Lodagala et al, ICASSP 2023
MonoBERT & PolyBERT: Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - Z Ma et al, INTERSPEECH 2023
MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - JW Yoon et al, INTERSPEECH 2023

Speech + Text

A general multi-task learning framework to leverage text data for speech to text tasks - Y Tang et al, ICASSP 2021
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - A Bapna et al, arXiv 2021
mSLAM: Massively multilingual joint pre-training for speech and text - A Bapna et al, arXiv 2022
Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - W Wang et al, INTERSPEECH 2022
STPT: Unified Speech-Text Pre-training for Speech Translation and Recognition - Y Tang et al, ACL 2022
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - Y Kang et al, AAAI 2022
Distill-L2S: Distilling a Pretrained Language Model to a Multilingual ASR Model - K Choi et al, INTERSPEECH 2022
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - Z Zhang et al, EMNLP 2022
TESSP: Text-Enhanced Self-Supervised Speech Pre-training - Z Yao et al, arXiv 2022
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - Z Zhang et al, arXiv 2022
token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - X Yue et al, ICASSP 2023

SSL for Audio

BYOL-A: BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - D Niizumi et al, IJCNN 2021
Audio-MAE: Masked Autoencoders that Listen - H Xu et al, NeurIPS 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - A Baade et al, INTERSPEECH 2022
BEATs: Audio Pre-Training with Acoustic Tokenizers - S Chen et al, ICML 2023
ATST: Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - X Li et al, arXiv 2023
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - W Chen et al, arXiv 2024

SSL for TTS

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - R Eloff et al, INTERSPEECH 2019
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - H Zhang et al, INTERSPEECH 2020
Towards Unsupervised Speech Synthesis - AH Liu et al, NAACL 2022

SSL Model Distillation, Compression and Acceleration

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - H Chang et al, ICASSP 2022
FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning- Y Lee et al, INTERSPEECH 2022
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT- R Wang et al, INTERSPEECH 2022
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - T Ashihara et al, INTERSPEECH 2022
Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - Y Wang et al, arXiv 2022
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - G Yang et al, ASRU 2023

Resources

Speech processing Universal PERformance Benchmark (SUPERB)

Self-Supervised Speech Pre-training and Representation Learning (S3PRL)

Statistics

Statistics on speech pretraining.

wav2vec 2.0

Pre-training

Size	Transformer	Samples	Batch Size	Train Time
BASE	12 blocks, model dimension 768, FFN 3072, 8 heads	1.4m(cropped)/GPU	1.6h	400k updates, 64 V100 * 1.6d
LARGE	24 blocks, model dimension 1024, FFN 4096, 16 heads	1.2m(cropped)/GPU	2.7h	250k updates, 128 V100 * 2.3d(Librispeech) 600k updates, 128 V100 * 5.2d(LibriVox)

Fine-tuning

wav2vec-u

Method	Feature Extractor	Batch Size	Train Time
wav2vec-U	wav2vec 2.0 LARGE	160 unlabeled audio + 160 text samples	150k steps, single V100 * 12h
wav2vec-U + self training	wav2vec 2.0 LARGE	/	80k updates, 8 V100(Librispeech) 13k updates, 4V100(TIMIT)

HuBERT

Pre-training

Size	Feature Extractor	Batch Size	Stage	Train Time
BASE	wav2vec 2.0 BASE(95M)	87.5s	1: MFCC 250k steps 2: 6-th transformer layer 400k steps	9.5h/100k steps, 32GPUs(Librispeech-960)
LARGE	wav2vec 2.0 LARGE(317M)	56.25s	3: 9-th transformer layer from BASE HuBERT 400k steps	9.5h/100k steps, 128GPUs(Libri-light-60k)
X-LARGE	Conformer XXL(964M)	22.5s	3: 9-th transformer layer from BASE HuBERT 400k steps	9.5h/100k steps, 256GPUs(Libri-light-60k)

Fine-tuning