Models and codes for INTERSPEECH 2023 paper DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
Models | Link |
---|---|
DistilXLSR128 | google drive |
DistilXLSR53 | google drive |
Language Models | google drive |
Our code are based on the fairseq toolkits. You can copy codes
folder into fairseq/fairseq/models
and rename it as distilxlsr (or other names) and use DistilXLSR as other fairseq models such as Wav2vec 2.0 or HuBERT. Please refer to theWav2vec2 guideline for further information about the usage of Wav2vec 2.0.
The selected 10-hour subsets of 5 languages in the Common Voice dataset (Version 5.1) are provided in the data
folder. You can select the mp3 samples according to the tsv
files, convert them to wav format, and save them in paths like $output_path/$language/wav/$file_name
such as /mnt/data/el/wav/common_voice_el_20583960.wav
. Please remember to change the first line of the tsv
files which provides the root folder of all the samples.
Run run_cv.sh
to fine-tune the DisilXLSR models on 5 languages. Training will take about 5 hours on a RTX-3090 GPU.
You can download the language models from the link in the table above. Unzip the models.
Run stage 1 in decode.sh
to decode the models. The Sclite toolkit is used for scoring, so we should format the transcription files for Sclite, and stage 2 in decode.sh
does this. After scoring, the results are printed on the screen.
We trained conformer-based E2E models and DNN-HMM (rather than GMM-HMM) models on 5 Common Voice languages, with the same no more than 10 hours subset.
Models | el | nl | eu | ia | pl | average |
---|---|---|---|---|---|---|
XLSR53 | 10.7 | 12.4 | 29.5 | 27.1 | 25.5 | 21.04 |
Proposed | 14.2 | 14.9 | 33.8 | 34.4 | 28.8 | 25.22 |
DNN-HMM | 43.4 | 10.26 | 25.77 | 71.71 | 21.48 | 34.524 |
E2E | 65.6 | 51.9 | 21.1 | 77.9 | 30.5 | 49.4 |
DistilXLSR models can also be used as feature extractors. The Python codes below show the method for loading the model and extracting features.
import torch
from fairseq.models.distilXLSR import DistilXLSR, DistilXLSRConfig
model_path = "path to the downloaded model checkpoint"
checkpoint = torch.load(model_path)
pretrained_model_cfg = checkpoint["Config"]["model"]
pretrained_model_cfg = DistilXLSRConfig(pretrained_model_cfg)
model = DistilXLSR(pretrained_model_cfg)
model.load_state_dict(checkpoint["Student"])
data = torch.randn(1, 10000) # (B, len_audio)
padding_mask = torch.zeros(1, 10000) # 1 for padded samples
(final_output, layer_results), padding_mask = model.forward(
source=data,
padding_mask=padding_mask,
ret_layer_results=True
)
if model.encoder.layer_norm_first:
layer_hiddens = [i[2] for i in layer_results]
layer_hiddens.pop(0)
layer_hiddens.append(final_output)
else:
layer_hiddens = [i[0] for i in layer_results]
x = layer_hiddens[-1]
print(x.shape)
Please note that for the layer_norm_first
models (XLSR-53 or XLSR-128) we use the outputs of the first layernorm module of each transformer layer as the output features; for the other models (or layer_norm_last
models such as Wav2vec 2.0 base) we simply use the outputs of each transformer layer.