The Self-Supervised Speech Pre-training and Representation Learning Toolkit Toolkit 🦜, built on PyTorch, for developing self-supervised learning upstream models on a wide variety of downstream tasks.
- Table of contents
- Introduction
- Installation
- Data preparation
- Train upstream models
- Downstream evaluations
- Evaluating your own model
- Using upstream models with your own task
- Tutorial for application on custom dataset
- Supplementary Wiki Page
- Development pattern for contributors
- Reference
- Citation
This is an open source project called S3PRL, which stands for Self-Supervised Speech Pre-training and Representation Learning. In this toolkit, various upstream self-supervised speech models are implemented with easy-to-load setups, and downstream evaluation tasks are available with easy-to-use scripts.
- Mockingjay
- Described in "Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders"
- Transformer based, BERT-style masked reconstruction loss
- These papers used our implementations: Adversarial Defense, Understanding Self-attention
- Accepted by ICASSP 2020 as an oral lecture.
- TERA
- Described in "TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech"
- Transformer based, multi-target alteration reconstruction loss
- Submitted to IEEE/ACM TASLP.
- Audio ALBERT
- Described in "Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation"
- Transformer based, BERT-style masked reconstruction loss
- Submitted to INTERSPEECH 2020.
- APC
- Described in "An Unsupervised Autoregressive Model for Speech Representation Learning"
- RNN based, unidirectional reconstruction loss
- Accepted by INTERSPEECH 2019 as an oral session.
- Phone classification:
- Speaker recognition:
- ASR speech recognition:
- Hybrid DNN/HMM speech recognition systems with the PyTorch-Kaldi Toolkit
- We provide pre-trained models (as the DNN part of hybrid DNN/HMM) with initializers that are PyTorch-Kaldi ready.
- Sentiment classification on spoken content:
- simple one-layer RNN classifier on MOSEI dataset
- Proposed and used in Mockingjay.
- Acoustic feature extraction scripts:
- LibriSpeech and TIMIT:
- WSJ: coming soon
- Extracted features can be directly download from: S3PRL Drive
- On-the-fly feature extraction using torchaudio as backend
- see section: Data preparation
- Pre-train your own self-supervised models:
- Implementation of various upstream algorithms.
- Pre-train them on your own data.
- Supporting various optimizers including: BERT Adam, AdamW, LAMB
- see section: Train upstream models
- Evaluate your own pre-trained model:
- Easy-to-use downstream evaluation scripts.
- Incorporate any pre-trained model of your own.
- see section: Evaluating your own model
- Apply pre-trained models on your own task:
- Easy-to-use pre-trained model initialization.
- Incorporate any downstream task with the provided pre-trained models.
- Implemented as PyTorch-Kaldi ready DNNs.
- Pre-trained checkpoints can be directly download from: S3PRL Drive
- see section: Using upstream models with your own task
- Knowledge transfer of pre-trained model to downstream task:
- We support various methods of incoporating the pre-trained model with downstream models:
- Extracting from the last layer
- Learnable weighted sum extraction from all layers (similar to ELMo)
- Fine-tuning
- See section: Apply different knowledge transfer methods
- We support various methods of incoporating the pre-trained model with downstream models:
Feel free to use or modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact tingweiandyliu@gmail.com. If you find this project helpful for your research, please do consider to cite our papers, thanks!
- Python 3 or above
- PyTorch 1.3.0 or above
- Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
- Required packages and their use are listed below, and also in requirements.txt:
joblib # parallel feature extraction & decoding
librosa # feature extraction
scipy # feature extraction
tqdm # verbosity
yaml # config parser
numpy # array computation
pandas # data management
tensorboardX # logger & monitor
torch # model & learning
matplotlib # visualization
Pillow # visualization
The above packages can be installed by the command: pip install -r requirements.txt
- Here we list optional packages that need special attention, and we recommend you to install them manually:
ipdb # debugger (Optional)
apex # faster optimization (Optional and non-essential, only needed if enabled in config)
pydub # audio segmentation (Optional, for MOSEI dataset preprocessing only)
Kaldi # feature extraction (Optional, if you want to extract features by yourself)
PyTorch-Kaldi # for hybrid ASR training (Optional)
For the installation and usage of Kaldi and PyTorch-Kaldi, see our supplementary wiki page: Extracting with Kaldi and ASR with PyTorch-Kalid
- Clone this repo:
git clone https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning.git
- If you have any importing errors, try the following.
- Also, to use the codes in this repo from another project (e.g. PyTorch-Kaldi), you have to set a global path.
- Open the file
~/.bashrc
in your text editor – e.g.subl ~/.bashrc
; - Add the following line to the end:
export PYTHONPATH="/your_abs_path/Self-Supervised-Speech-Pretraining-and-Representation-Learning:$PYTHONPATH"
Make sure you change it to your own path.
- Restart your terminal application to read in the new settings, and type this to check if everything is working:
echo $PYTHONPATH
- Now in any python environment or .py file, we can do the following in any directory:
from transformer.nn_transformer import TRANSFORMER
- Read the documentation if you run into any problem.
- For Windows, add the following lines to your .py code:
import sys
# set this to your own path
S3PRL_PATH = "C:\\Users\\ANDYLIU\\Self-Supervised-Speech-Pretraining-and-Representation-Learning"
if S3PRL_PATH not in sys.path:
sys.path.append(S3PRL_PATH)
- We provide the features we extracted for you to download directly: S3PRL Drive
Structure of S3PRL Drive:
data/
libri_mfcc_cmvn.zip
libri_fbank_cmvn.zip
libri_fmllr_cmvn.zip # features used for TERA
timit_fmllr_cmvn.zip
libri_mel160_subword5000 # features used for Mockingjay
- Download then unzip them, for example:
cd data/
unzip libri_fmllr_cmvn.zip
- Modify the setting in config files:
config/downstream.yaml
, and others if needed:
data_path: 'data/libri_fmllr_cmvn'
- Download the LibriSpeech dataset and place under
data/
:data/LibriSpeech
. - The extracted data, which is ready for training, will be stored under the same
data/
directory by default.
# To preprocess different acoustic features, options are:
python preprocess/preprocess_libri.py --feature_type=mfcc --delta=True --delta_delta=True # this generates: /data/libri_mfcc39, window_size=25ms, stride=10ms
python preprocess/preprocess_libri.py --feature_type=fbank --delta=False # this generates: /data/libri_fbank80, window_size=25ms, stride=10ms
python preprocess/preprocess_libri.py --feature_type=fbank --delta=True # this generates: /data/libri_fbank160, window_size=25ms, stride=10ms
# features used for old Mockingjay pre-trained models (also for the Montreal phone set)
python preprocess/preprocess_libri.py --feature_type=linear --delta=False # 1025-dim, window_size=50ms, stride=12.5ms
python preprocess/preprocess_libri.py --feature_type=mel --delta=True # 160-dim, window_size=50ms, stride=12.5ms
python preprocess/preprocess_timit.py --feature_type=fbank --delta=False # 80-dim, window_size=25ms, stride=10ms
python preprocess/preprocess_timit.py --feature_type=mfcc --delta=True --delta_delta=True # 39-dim, window_size=25ms, stride=10ms
# old preprocessing settings:
python preprocess/preprocess_timit.py --feature_type=mel --data_path=../data/LibriSpeech # 160-dim, window_size=50ms, stride=12.5ms
python preprocess/preprocess_timit.py --feature_type=linear --delta=False # 1025-dim, window_size=50ms, stride=12.5ms
- To extract with Kaldi, see the supplementary wiki page for detailed instructions: Extracting with Kaldi
- Example codes are provided for the conversion of Kaldi .ark to .npy, which supports the format of a regular pytorch dataset.
- TIMIT: preprocess/ark2timit.py
- LibriSpeech: preprocess/ark2libri.py
- VoxCeleb: preprocess/ark2voxceleb.py
- Or download the extracted features from here: S3PRL Drive
- Place the downloaded
*.zip
files underdata/
:
cd data/
unzip libri_fmllr_cmvn.zip # features used for TERA
- This feature allow users to run training and testing with out preprocessing data, feature extraction is done during runtime (This will not increase your training time!).
- To enable bucketing (optional, but substantially increase training efficiency), you need to run this script to get all the length of the training data.
python preprocess/generate_len_for_bucket.py --data_root=data/LibriSpeech/ # this generates: /data/len_for_bucket
Next change the following attribute in your config/upstream.yaml
and config/downstream.yaml
:
dataloader:
data_path: '/data/len_for_bucket'
- Finally, add the following argument when runing upstream/downstream scripts (pre-trained checkpoints will automatically use their saved
online.yaml
during pre-training, so no need to specify for pre-trained checkpoints):
--online_config=config/online.yaml
- 41 phone classes, this set is considered in the CPC, TERA papers.
- To use the CPC phone alignment data, use the following command:
cd data/cpc_phone
unzip converted_aligned_phones.zip
- Make sure that in
config/downstream.yaml
, phone path is set to:
phone_path: 'data/cpc_phone'
- IMPORTANT: these phone alignments correspond to a feature/label for every 10ms, you need to use features with windows of 25 ms and an overlap of 10 ms, we recommend the Kaldi extracted features.
- 72 phone classes, this set is considered in the Mockingjay paper.
- To use the Montreal Forced Aligner phone alignment data, download the
libri_alignment.zip
from S3PRL Drive and place under thedata/
directory:
cd data
unzip libri_alignment.zip
cd ..
python preprocess/preprocess_alignment.py
- Change the setting in
config/downstream.yaml
:
phone_path: 'data/libri_phone'
- Warning: you need to use
preprocess/preprocess_libri.py --feature_type=mel
to extract matching features.
- For the pre-training of each model, we provide default configs files
*.yaml
under theconfig/
directory. However, you may change them according to your needs. - Warning: the parameters may not strictly follow the original papers, please verify carefully if you need them to be identical.
- The argument
--name
is used for distinction only, you can use whatever name you want.
# Mockingjay BASE, 360 hr
python run_upstream.py --run=transformer --config=config/mockingjay_libri_fbankBase.yaml --name=mockingjay_fbankBase
# Mockingjay LARGE, 360 hr
python run_upstream.py --run=transformer --config=config/mockingjay_libri_fbankLarge.yaml --name=mockingjay_fbankLarge
# TERA-Base: time + channel + mag, 960 hr
python run_upstream.py --run=transformer --config=config/tera_libri_fmllrBase.yaml --name=tera_fmllrBase
# TERA-Medium: time + channel + mag, 960 hr
python run_upstream.py --run=transformer --config=config/tera_libri_fmllrMedium.yaml --name=tera_fmllrMedium
# TERA-Large: time + channel + mag, 960 hr
python run_upstream.py --run=transformer --config=config/tera_libri_fmllrLarge.yaml --name=tera_fmllrLarge
# AALBERT-3L, 100 hr
python run_upstream.py --run=transformer --config=config/aalbert_libri_fbank3L.yaml --name=aalbert_fbank3L
# AALBERT-6L, 360 hr
python run_upstream.py --run=transformer --config=config/aalbert_libri_fbank6L.yaml --name=aalbert_fbank6L
python run_upstream.py --run=apc
- The below commands are used for evaluating the transformer models, where we specify
--upstream=transformer
. - The type of pre-trained transformers (Mockingjay, AALBERT, TERA) will be decided by the pre-trained checkpoint:
--ckpt
.
# **Phone Linear** Frame-wise Classification on LibriSpeech
python run_downstream.py --run=phone_linear --upstream=transformer --ckpt=path_to_ckpt/states-1000000.ckpt
# **Phone 1 Hidden** Frame-wise Classification on LibriSpeech
python run_downstream.py --run=phone_1hidden --upstream=transformer --ckpt=path_to_ckpt/states-1000000.ckpt
# **Phone Concat** Frame-wise Classification on LibriSpeech
python run_downstream.py --run=phone_concat --upstream=transformer --ckpt=path_to_ckpt/states-1000000.ckpt
# **Speaker Frame**-wise Classification on LibriSpeech
python run_downstream.py --run=speaker_frame --upstream=transformer --ckpt=path_to_ckpt/states-1000000.ckpt
# **Speaker Utterance**-wise Classification on LibriSpeech
python run_downstream.py --run=speaker_utterance --upstream=transformer --ckpt=path_to_ckpt/states-1000000.ckpt
- Simply add
--weighted_sum
to the above commands. - For example, phone linear frame-wise classification on LibriSpeech:
python run_downstream.py --weighted_sum --run=phone_linear --upstream=transformer --ckpt=path_to_ckpt/states-1000000.ckpt
- Simply add
--fine_tune
to the above commands. - For example, phone linear frame-wise classification on LibriSpeech:
python run_downstream.py --fine_tune --run=phone_linear --upstream=transformer --ckpt=path_to_ckpt/states-1000000.ckpt
- Simply change the
--upstream=transformer
to--upstream=baseline
, and we no longer need to specify--ckpt
. - For example, phone linear frame-wise classification on LibriSpeech:
python run_downstream.py --run=phone_linear --upstream=baseline
- See the supplementary wiki page for detailed instructions: ASR with PyTorch-Kalid
- You can easily insert your own upstream models to the evaluation script
run_downstream.py
. - There are only three simple requirements for each upstream model:
- Implement the
forward
method ofnn.Module
, - Contains the
out_dim
attribute. - Takes input and output in the shape of: (batch_size, time_steps, feature_dim)
- Implement the
- Initialize your model at the function
get_upstream_model
inrun_downstream.py
:
elif args.upstream == 'your_model':
example_options = {'ckpt_file' : args.ckpt,
'input_dim' : args.input_dim,
'load_pretrain' : True}
upstream_model = YOUR_MODEL(example_options)
- Now you can evaluate your model with
--upstream=your_model
. - Make sure the input acoustic features align with your pre-trained model.
- You can also fine-tune or extract from the pre-trained upstream model on your own dataset and tasks!
- IMPORTANT: You must use input acoustic features with the same preprocessing settings and pipeline as pre-trained models!!!
- Pre-trained checkpoints can be download from: S3PRL Drive
- Mockingjay Models:
Download the data of
libri_mel160_subword5000.zip
, or follow the pipeline used inpython preprocess/preprocess_libri.py --feature_type=mel
to extract identical 160-dim mel features. - TERA Models:
Download the data of
libri_fmllr_cmvn.zip
, or follow the pipeline used in the Kaldi s5 recipe to extract identical 40-dim fmllr features. - AALBERT Models:
Coming soon, download the data of
libri_fbank_cmvn.zip
, or follow the pipeline used in the Kaldi s5 recipe to extract identical 80-dim fbank features.
- Mockingjay Models:
Download the data of
- WARNING: If you are getting bad or worse results, it's probably caused by the mismatch of acoustic features between pre-trained models and downstream task!!!
- Below we show an example code of fine-tuning an upstream model with your own downstream model, by using the wrapper class in nn_transformer.py:
import torch
from transformer.nn_transformer import TRANSFORMER
from downstream.model import example_classifier
from downstream.solver import get_optimizer
# setup the transformer model
"""
`options`: a python dictionary containing the following keys:
ckpt_file: str, a path specifying the pre-trained ckpt file
load_pretrain: str, ['True', 'False'], whether to load pre-trained weights
no_grad: str, ['True', 'False'], whether to have gradient flow over this class
dropout: float/str, use float to modify dropout value during downstream finetune, or use the str `default` for pre-train default values
spec_aug: str, ['True', 'False'], whether to apply SpecAugment on inputs (used for ASR training)
spec_aug_prev: str, ['True', 'False'], apply spec augment on input acoustic features if True, else apply on output representations (used for ASR training)
weighted_sum: str, ['True', 'False'], whether to use a learnable weighted sum to integrate hidden representations from all layers, if False then use the last
select_layer: int, select from all hidden representations, set to -1 to select the last (will only be used when weighted_sum is False)
permute_input: str, ['True', 'False'], this attribute is for the forward method. If Ture then input ouput is in the shape of (T, B, D), if False then in (B, T, D)
"""
options = {
'ckpt_file' : './result/result_transformer/tera/fmllrBase960-F-N-K-libri/states-1000000.ckpt',
'load_pretrain' : 'True',
'no_grad' : 'True',
'dropout' : 'default',
'spec_aug' : 'False',
'spec_aug_prev' : 'True',
'weighted_sum' : 'False',
'select_layer' : -1,
'permute_input' : 'False',
}
transformer = TRANSFORMER(options=options, inp_dim=0) # set `inpu_dim=0` to auto load the `inp_dim` from `ckpt_file`
# setup your downstream class model
classifier = example_classifier(input_dim=768, hidden_dim=128, class_num=2).cuda()
# construct the optimizer
params = list(transformer.named_parameters()) + list(classifier.named_parameters())
optimizer = get_optimizer(params=params, lr=4e-3, warmup_proportion=0.7, training_steps=50000)
# forward
example_inputs = torch.zeros(3, 1200, 40) # A batch of spectrograms: (batch_size, time_step, feature_size)
# IMPORTANT: Input acoustic features must align with the ones used during our pre-training!
reps = transformer(example_inputs) # returns: (batch_size, time_step, feature_size)
labels = torch.LongTensor([0, 1, 0]).cuda()
loss = classifier(reps, labels)
# update
loss.backward()
optimizer.step()
# save
PATH_TO_SAVE_YOUR_MODEL = 'example.ckpt'
states = {'Classifier': classifier.state_dict(), 'Transformer': transformer.state_dict()}
# torch.save(states, PATH_TO_SAVE_YOUR_MODEL)
For any arbitrary dataset that looks like this:
- Custom_dataset/
- Custom_train/
- *.wav / flac / mp3 ...
- Custom_dev/
- *.wav / flac / mp3 ...
- Custom_test/
- *.wav / flac / mp3 ...
The script preprocess/preprocess_any.py
will process the "train", "dev", "test" set one by one:
python preprocess/preprocess_any.py --audio_extention=.flac
Users only need to specify the path of the directory of each set. So for the example above:
- the path to the "train" set should be:
Custom_dataset/Custom_train/
- the path to the "dev" set should be:
Custom_dataset/Custom_dev/
- the path to the "test" set should be:
Custom_dataset/Custom_test/
The generated files will be compatible to our dataloader.
Also, in your config file *.yaml
, these should be changed:
data_path: 'data/NewData_fbank80'
train_set: ['train']
dev_set: ['dev']
test_set: ['test']
- Create a personal fork of the main S3PRL repository in GitHub.
- Make your changes in a named branch different from
master
, e.g. you create a branchnew-awesome-feature
. - Generate a pull request through the Web interface of GitHub.
- Please verify that your code is free of basic mistakes, we appreciate any contribution!
- Montreal Forced Aligner, McAuliffe et. al.
- CMU MultimodalSDK, Amir Zadeh.
- PyTorch Transformers, Hugging Face.
- Autoregressive Predictive Coding, Yu-An Chung.
- Contrastive Predictive Coding, Aaron van den Oord.
- End-to-end ASR Pytorch, Alexander-H-Liu.
- Tacotron Preprocessing, Ryuichi Yamamoto (r9y9)
- PyTorch-Kaldi, Mirco Ravanelli
- Kaldi, Kaldi-ASR
- The S3PRL Toolkit:
@misc{S3PRL,
author = {Andy T. Liu and Yang Shu-wen},
title = {S3PRL: The Self-Supervised Speech Pre-training and Representation Learning Toolkit},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning}
}
Here we also list all papers that use our toolkit (Feel free to add your own paper by making a pull request).
- Mockingjay:
@article{mockingjay,
title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
ISBN={9781509066315},
url={http://dx.doi.org/10.1109/ICASSP40776.2020.9054458},
DOI={10.1109/icassp40776.2020.9054458},
journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
publisher={IEEE},
author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},
year={2020},
month={May}
}
- TERA:
@misc{tera,
title={TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech},
author={Andy T. Liu and Shang-Wen Li and Hung-yi Lee},
year={2020},
eprint={2007.06028},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
- Mockingjay for Adversarial Defense, code for computing LNSR: utility/observe_lnsr.py
@misc{mockingjay_defense,
title={Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning},
author={Haibin Wu and Andy T. Liu and Hung-yi Lee},
year={2020},
eprint={2006.03214},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
- Understanding SAT:
@misc{understandingSAT,
title={Understanding Self-Attention of Self-Supervised Audio Transformers},
author={Shu-wen Yang and Andy T. Liu and Hung-yi Lee},
year={2020},
eprint={2006.03265},
archivePrefix={arXiv},
primaryClass={cs.CL}
}